NAME
tesseract - command line OCR tool
SYNOPSIS
Part of the process to train tesseract for a new language. Tesseract
uses 3 dictionary files for each language. Two of the files are coded
as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8
text file. To make the DAWG dictionary files, you first need a wordlist
for your language. The wordlist is formatted as a UTF-8 text file with
one word per line. Split the wordlist into two sets: the frequent
words, and the rest of the words, and then use wordlist2dawg to make
the DAWG files:
wordlist2dawg frequent_words_list freq-dawg
wordlist2dawg words_list word-dawg
DESCRIPTION
This manual page documents briefly the wordlist2dawg command.
tesseract is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3
evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.
SEE ALSO
feh(1), convert(1), mftraining(1), cntraining(1),
unicharset_extractor(1), tesseract(1).
AUTHOR
tesseract was written by Ray Smith.
This manual page was written by Jeffrey Ratcliffe
<Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used
by others).
August 21, 2007