NAME
swath - General-purpose Thai word segmentation utility
SYNOPSIS
swath [options] < infile > outfile
DESCRIPTION
Thai script has no word delimitor. Applications need some knowledge
about Thai word list to recognize word boundaries before they can do
useful things about Thai text, such as line wrapping.
Swath provides word analysis filter to insert word delimitors in a text
stream. It reads text from standard input, analyze it for word
boundaries by consulting a Thai word list, and output to standard
output the same text with the predefined word delimitors inserted.
Currently, it can read plain text, HTML, RTF, LaTeX and Lambda (Unicode
version of LaTeX with Omega typesetter kernel) documents and insert
commonly used word delimitors for each format (pipe ‘|’ for plain
text). But the user can always override this with a preferred
delimitor.
OPTIONS
-b [delimitor]
Define a string to be used as word delimitor code in the output
text.
-d [dict-dir]
Specify alternative dictionary location. dict-dir must contain
the swath dictionary files ‘swathdic.br’ and ‘swathdic.tl’.
-f [format]
Specify format of the input. Possible formats are: html, rtf,
latex, lambda.
-m [scheme]
Choose word matching scheme when analyzing word boundaries.
Possible schemes are ‘long’ (for longest or greedy matching) and
‘max’ (for maximal matching, with least words preferred).
Maximal matching is the default value.
-u input-enc,output-enc
Specify encodings of input and output. input-enc and output-enc
can be one of ’u’ (for UTF-8 encoding) and ’t’ (for TIS-620
encoding). Swath will convert the character encoding as
necessary. If omitted, TIS-620 encodings on both input and
output are assumed.
-v, --verbose
Turn on verbose mode.
-help, --help
Show help.
EXAMPLES
For LaTeX (to be used with thailatex package):
$ swath -f latex < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex
For HTML (to provide web pages to web browsers that cannot wrap Thai
lines properly, but support the <wbr> tag):
$ swath -f html < myweb.html > myweb-wbr.html
To preprocess a Thai UTF-8 encoded LaTeX file for thailatex, which
always works with TIS-620:
$ swath -f latex -u u,t < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex
This is equivalent to filtering with iconv(1):
$ iconv -f UTF-8 -t TIS-620 thaifile.tex | swath -f latex >
thaifile.ttex
$ latex thaifile.ttex
To use longest matching scheme with LaTeX document:
$ swath -f latex -m long < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex
AUTHOR
This manual page was written by Theppitak Karoonboonyanan
<thep@linux.thai.net>.
January 2008