Man Linux: Main Page and Category List

NAME

       tesseract - command line OCR tool

SYNOPSIS

       Part  of  the  process to train tesseract for a new language. Tesseract
       needs to know the set of possible characters it can output. To generate
       the  unicharset  data file, use the unicharset_extractor program on the
       training pages bounding box files:

       unicharset_extractor fontfile_1.box fontfile_2.box ...

DESCRIPTION

       This manual page documents briefly the unicharset_extractor command.

       tesseract is a commercial quality OCR engine originally developed at HP
       between  1985  and  1995.  In  1995,  this  engine  was among the top 3
       evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.

       Tesseract  needs  to  have  access  to  character  properties  isalpha,
       isdigit,  isupper, islower. This data must be encoded in the unicharset
       data file. Each line of this file corresponds  to  one  character.  The
       character  in  UTF-8 is followed by a hexadecimal number representing a
       binary mask that encodes the properties.  Each  bit  corresponds  to  a
       property.  If  the bit is set to 1, it means that the property is true.
       The bit ordering is (from least significant  bit  to  most  significant
       bit): isalpha, islower, isupper, isdigit.

SEE ALSO

       feh(1),   convert(1),   mftraining(1),   cntraining(1),   tesseract(1),
       wordlist2dawg(1).

AUTHOR

       tesseract was written by Ray Smith.

       This    manual    page    was    written    by    Jeffrey     Ratcliffe
       <Jeffrey.Ratcliffe@gmail.com>,  for the Debian project (but may be used
       by others).

                                August 21, 2007