NAME
ocrodjvu - OCR for DjVu files
SYNOPSIS
ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}
DESCRIPTION
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on
DjVu files.
The following OCR engines are supported:
· OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or
rec-tess) command, so that ultimately Tesseract acts as the OCR
backend);
· Cuneiform for Linux[2].
OPTIONS
OCR engine options
--engine=engine-id
Use this OCR engine. The default is ‘ocropus’ (OCRopus).
--list-engines
Print list of available OCR engines.
Options controlling output
It is mandatory to use exactly one of the following options:
-o, --save-bundled=output-djvu-file
Save OCR results as a bundled multi-page document into
output-djvu-file.
-i, --save-indirect=index-djvu-file
Save OCR results as an indirect multi-page document. Use
index-djvu-file as the index file name; put the component files
into the same directory. The directory must exist and be writable.
--save-script=script-file
Save a djvused script with OCR results into script-file.
--in-place
Save OCR results in place.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--dry-run
Don't change any files, throw OCR results away.
Text segmentation options
-t lines, --details lines
Record location of every line. Don't record locations of particular
words or characters.
This is the default for OCRopus 0.2.
-t words, --details=words
Record location of every line and every word. Don't record
locations of particular characters.
This is the default for OCRopus ≥ 0.3.1 and for Cuneiform.
This option is ineffective with OCRopus 0.2.
-t chars, --details=chars
Record location of every line, every word and every character.
This option is ineffective with OCRopus 0.2.
--word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a
single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the Unicode Text Segmentation[3] algorithm to break lines into
words.
This option breaks assumptions of some DjVu tools that words are
separated by spaces, and therefore is it not recommended.
Other options
--clear-text
Remove existing hidden text if present in the pages not selected
for OCR.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--ocr-only
Don't save pages that were not processed.
--language=language-id
Set recognition language. language-id is typically an ISO 639-2
three-letter code.
For OCRopus, the default is ‘eng’ (English), unless the
tesslanguage environment variable is set. For other OCR engines,
the default is always ‘eng’.
--list-languages
Print list of available languages for the currently selected OCR
engine.
--render=mask
Render only masks of page images.
This is the default.
--render=foreground
Render only foreground layers of page images.
--render=all
Render all layers of page images.
This option is necessary to OCR DjVu files with invalid
foreground/background separation.
-p, --pages=page-range
Specifies pages to process. page-range is a comma-separated list
of sub-ranges. Each sub-range is either a single page (e.g. 17) or
a contiguous range of pages (e.g. 37-42). Pages are numbered from
1.
The default is to process all pages.
-j, --jobs=n
Start up to n OCR processes.
-D, --debug
To ease debugging, don't delete intermediate files.
--version
Output version information and exit.
-h, --help
Display help and exit.
ENVIRONMENT
The following environment variables affects ocrodjvu:
tesslanguage
Recognition language for Tesseract.
(Use this variable is deprecated in favor of the --language
option.)
TMPDIR
Directory for temporary files. The default is /tmp.
SEE ALSO
djvu(1), ocroscript(1), tesseract(1)
AUTHOR
Jakub Wilk <jwilk@jwilk.net>
Author.
COPYRIGHT
Copyright © 2008, 2009, 2010 Jakub Wilk
NOTES
1. OCRopus
http://ocropus.googlecode.com/
2. Cuneiform for Linux
http://launchpad.net/cuneiform-linux
3. Unicode Text Segmentation
http://unicode.org/reports/tr29/