Man Linux: Main Page and Category List

NAME

       tesseract - command line OCR tool

SYNOPSIS

       Part  of  the  process to train tesseract for a new language. Tesseract
       uses 3 dictionary files for each language. Two of the files  are  coded
       as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8
       text file. To make the DAWG dictionary files, you first need a wordlist
       for  your language. The wordlist is formatted as a UTF-8 text file with
       one word per line. Split the  wordlist  into  two  sets:  the  frequent
       words,  and  the  rest of the words, and then use wordlist2dawg to make
       the DAWG files:

       wordlist2dawg frequent_words_list freq-dawg

       wordlist2dawg words_list word-dawg

DESCRIPTION

       This manual page documents briefly the wordlist2dawg command.

       tesseract is a commercial quality OCR engine originally developed at HP
       between  1985  and  1995.  In  1995,  this  engine  was among the top 3
       evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.

SEE ALSO

       feh(1),        convert(1),        mftraining(1),         cntraining(1),
       unicharset_extractor(1), tesseract(1).

AUTHOR

       tesseract was written by Ray Smith.

       This     manual    page    was    written    by    Jeffrey    Ratcliffe
       <Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be  used
       by others).

                                August 21, 2007