NAME
tesseract - command line OCR tool
SYNOPSIS
Part of the process to train tesseract for a new language. Tesseract
needs to know the set of possible characters it can output. To generate
the unicharset data file, use the unicharset_extractor program on the
training pages bounding box files:
unicharset_extractor fontfile_1.box fontfile_2.box ...
DESCRIPTION
This manual page documents briefly the unicharset_extractor command.
tesseract is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3
evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.
Tesseract needs to have access to character properties isalpha,
isdigit, isupper, islower. This data must be encoded in the unicharset
data file. Each line of this file corresponds to one character. The
character in UTF-8 is followed by a hexadecimal number representing a
binary mask that encodes the properties. Each bit corresponds to a
property. If the bit is set to 1, it means that the property is true.
The bit ordering is (from least significant bit to most significant
bit): isalpha, islower, isupper, isdigit.
SEE ALSO
feh(1), convert(1), mftraining(1), cntraining(1), tesseract(1),
wordlist2dawg(1).
AUTHOR
tesseract was written by Ray Smith.
This manual page was written by Jeffrey Ratcliffe
<Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used
by others).
August 21, 2007