Man Linux: Main Page and Category List

NAME

       ocrodjvu - OCR for DjVu files

SYNOPSIS

       ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file

       ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file

       ocrodjvu --save-script script-file [option...] djvu-file

       ocrodjvu --in-place [option...] djvu-file

       ocrodjvu --dry-run [option...] djvu-file

       ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

       ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on
       DjVu files.

       The following OCR engines are supported:

       ·   OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or
           rec-tess) command, so that ultimately Tesseract acts as the OCR
           backend);

       ·   Cuneiform for Linux[2].

OPTIONS

   OCR engine options
       --engine=engine-id
           Use this OCR engine. The default is ‘ocropus’ (OCRopus).

       --list-engines
           Print list of available OCR engines.

   Options controlling output
       It is mandatory to use exactly one of the following options:

       -o, --save-bundled=output-djvu-file
           Save OCR results as a bundled multi-page document into
           output-djvu-file.

       -i, --save-indirect=index-djvu-file
           Save OCR results as an indirect multi-page document. Use
           index-djvu-file as the index file name; put the component files
           into the same directory. The directory must exist and be writable.

       --save-script=script-file
           Save a djvused script with OCR results into script-file.

       --in-place
           Save OCR results in place.

           (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --dry-run
           Don't change any files, throw OCR results away.

   Text segmentation options
       -t lines, --details lines
           Record location of every line. Don't record locations of particular
           words or characters.

           This is the default for OCRopus 0.2.

       -t words, --details=words
           Record location of every line and every word. Don't record
           locations of particular characters.

           This is the default for OCRopus ≥ 0.3.1 and for Cuneiform.

           This option is ineffective with OCRopus 0.2.

       -t chars, --details=chars
           Record location of every line, every word and every character.

           This option is ineffective with OCRopus 0.2.

       --word-segmentation=simple
           Consider each non-empty sequence of non-whitespace characters a
           single word.

           This is the default, despite being linguistically incorrect.

       --word-segmentation=uax29
           Use the Unicode Text Segmentation[3] algorithm to break lines into
           words.

           This option breaks assumptions of some DjVu tools that words are
           separated by spaces, and therefore is it not recommended.

   Other options
       --clear-text
           Remove existing hidden text if present in the pages not selected
           for OCR.

           (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --ocr-only
           Don't save pages that were not processed.

       --language=language-id
           Set recognition language.  language-id is typically an ISO 639-2
           three-letter code.

           For OCRopus, the default is ‘eng’ (English), unless the
           tesslanguage environment variable is set. For other OCR engines,
           the default is always ‘eng’.

       --list-languages
           Print list of available languages for the currently selected OCR
           engine.

       --render=mask
           Render only masks of page images.

           This is the default.

       --render=foreground
           Render only foreground layers of page images.

       --render=all
           Render all layers of page images.

           This option is necessary to OCR DjVu files with invalid
           foreground/background separation.

       -p, --pages=page-range
           Specifies pages to process.  page-range is a comma-separated list
           of sub-ranges. Each sub-range is either a single page (e.g. 17) or
           a contiguous range of pages (e.g. 37-42). Pages are numbered from
           1.

           The default is to process all pages.

       -j, --jobs=n
           Start up to n OCR processes.

       -D, --debug
           To ease debugging, don't delete intermediate files.

       --version
           Output version information and exit.

       -h, --help
           Display help and exit.

ENVIRONMENT

       The following environment variables affects ocrodjvu:

       tesslanguage
           Recognition language for Tesseract.

           (Use this variable is deprecated in favor of the --language
           option.)

       TMPDIR
           Directory for temporary files. The default is /tmp.

SEE ALSO

       djvu(1), ocroscript(1), tesseract(1)

AUTHOR

       Jakub Wilk <jwilk@jwilk.net>
           Author.

COPYRIGHT

       Copyright © 2008, 2009, 2010 Jakub Wilk

NOTES

        1. OCRopus
           http://ocropus.googlecode.com/

        2. Cuneiform for Linux
           http://launchpad.net/cuneiform-linux

        3. Unicode Text Segmentation
           http://unicode.org/reports/tr29/