NAME
ocropus - command line OCR tool
SYNOPSIS
ocroscript <script> <arguments>
DESCRIPTION
You can see a list of all available commands by looking in the
$OCROSCRIPTS (/usr/share/ocropus/scripts/ by default) path.
The ‘recognize’ script uses tesseract for recognition and sends the
html-based hOCR ouput to stdout. Tesseract is probably the most mature
text recognizer within OCRopus at the moment. Natively, Tesseract
doesn’t do layout analysis, but combined with OCRopus, it makes for a
pretty good OCR system:
$ ocroscript recognize page.png > page.html
Here is a brief summary of the remaining command line commands
available. You will need to look at the script to see what the command
line arguments are:
degrade.lua
Simple document image degradation
hocr-to-text.lua
Convert hOCR output to plain text.
line-clean.lua
Given a line image, remove marginal noise and fix some other
problems.
sauvola.lua
Perform Sauvola thresholding.
SEE ALSO
tesseract(1),
AUTHOR
ocroscript was written by Thomas Breuel.
This manual page was written by Jeffrey Ratcliffe
<Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used
by others).
June 06, 2008 ocroscript(1)