Note that text shown in this style documents a feature which isn't in the current release but will be in the next release and
text shown thus indicates a feature which is
being removed in the next release.
If you find anything in this documentation which is wrong or unclear then please use the link at the bottom on the page to comment and we will update the page to correct it or make it clearer.
When uploading a document Opus checks what sort of document it is by examining its file extension. It will only accept certain document types. The default set is MS/Word, Adobe PDF and Postscript. Opus uses helper applications to extract the raw text from these documents. It tries to use antiword
for MS/Word but if it can't find that it uses strings
. For PDF and Postscript it uses pdftotext
, part of the
Xpdf package
or, failing that, ps2ascii
, part of the
ghostscript package.
If you are using shared hosting and your ISP doesn't provide these binaries you can put them in ./php/ext
and Opus will pick them up from there in preference to any other binary. Note that you need to be careful when doing this and used statically linked binaries to avoid problems when the ISP changes their shared libraries. For example a statically linked version of pdftotext
can be found
here.
You can override the list of binaries used for the different document types by creating a file ./php/cfg/document_types
and listing the valid file types in there, one per line. The fields are separated by commas, the first being the file extension in upper case, the second a description of the file format, and the remaining fields being a list of zero to many helper applications Opus can use to extract the raw text. Opus will attempt to use the first listed and then fall back through any remaining entries if it can't first the first listed.
Here's an example which emulates the default behaviour:
DOC,MS/Word,antiword,strings PDF,Adobe PDF,pdftotext,ps2ascii PS,Postscript,pdftotext,ps2ascii