fst-compiler-utf8 - Two compilers for SFST programs

NAME

       fst-compiler-utf8 - Two compilers for SFST programs

SYNOPSIS

       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [ output-file ]

OPTIONS

       -c     Store  the  transducer  in  compact format which is used by fst-
              infl2 and.

       -l     Store the transducer in lowmem format.

       -s     Switch upper and lower level of the transducer. You have to  use
              this  switch in order to use fst-infl (fst-infl2, fst-infl3) for
              generation rather than analysis.

DESCRIPTION

       fst-compiler is a compiler for  finite-state  transducer  programs.  It
       generates  a  minimized  finite state transducer which can be used with
       fst-mor, fst-infl, fst-print, fst-compare, fst-parse, and  fst-lattice.
       The  compact  transducer  representation which is generated with the -c
       flag, is supported by fst-infl2, fst-train, and fst-match.  The memory-
       efficient  transducer  representation  which  is  generated with the -l
       flag, is only supported by fst-infl3.

       The first program argument is the name of a  file  which  contains  the
       transducer  program.  The  programming language is described below. The
       second argument is  the  name  of  the  file  to  which  the  resulting
       transducer  will  be  written  in  binary form. If a second argument is
       missing, the output will be written to stdout.

       fst-compiler-utf8 differs  from  fst-compiler  only  in  the  character
       encoding.  fst-compiler-utf8 supports UTF8 encoding of the source files
       whereas fst-compiler is to be  used  for  8-Bit  character  codes  like
       latin1  which are an extension of the ASCII code. Information about the
       encoding is stored in the transducer files and used by the  other  SFST
       programs.

FILE FORMATS

       A transducer program consists of an (optional) sequence of alphabet and
       variable definitions followed by a single transducer  expression  which
       defines the result transducer.

       Alphabet

       An  alphabet  definition consists of the keyword ALPHABET followed by =
       and some transducer expression e.g.

           ALPHABET = [a-z]:[A-Z]

       This command  redefines  the  alphabet  as  the  set  of  symbol  pairs
       occurring  on  the  transitions  of the transducer. Occurrences of two-
       level operators, negation operators and unquoted periods always have to
       be preceded by an alphabet definition.

       Variables

       There  are  two different types of variables.  Symbol set variables are
       enclosed by hash signs (#) and take symbol  sequences  (see  below)  as
       values:

           #UC# = A-Z
           #LC# = a-z

       Transducer  variables  are enclosed by dollar signs and take transducer
       expressions as values:

           $MAP$ = [a-z]:[A-Z]+
           $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with the symbol ‘=’ are  special  agreement
       variables.  If  an  agreement  variable  occurs  more  than  once  in a
       transducer expression, it will always have the same value. Consider the
       following transducer program:

           $=1$ = [abc]
           $=1$ X $=1$

       The  result  transducer  recognizes the strings aXa, bXb, and cXc. Only
       acyclic transducers (i.e. transducers  with  a  finite  set  of  string
       mappings) can be assigned to agreement variables.

       Symbols

       A symbol is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is always
         enclosed in angle brackets) or

       - a backslash followed by a number which is the numeric code of the
         designated character

       - the null symbol <>.

       Symbol sequence

       A  symbol sequence is a sequence of characters, multi-character symbols
       and character ranges, e.g. a-z \. <x>.

       symbol range

       A symbol range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       - a symbol sequence starting with ^ and  enclosed  in  square  brackets
       like [^A-Za-z] (designating the complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair of two symbol ranges separated by a colon is a TE.

         [a-z]:[a-Z]

       - A single symbol range like [a-z] is a TE.
         It is a short form for [a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
        a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d c:e <>:f.

       - X Y is a TE if X and Y are TEs.
         (Blanks are ignored unless they are quoted.)

       - (X) is a TE if X is a TE.

       -  X  op  is  a  TE  is  X  is  a  TE and op is either * (Kleene’s star
       operator), +
        (Kleene’s plus operator), or ? (optionality operator)

       - op X is a TE is X is a TE and op is either ! (negation operator), ^
        (target  language  extraction operator), _ (source language extraction
        operator), or ^_ (source and target switch operator).

       - X op Y is a TE is X and Y are TEs and op is either & (conjunction
        operator), | (disjunction operator), || (composition operator),  or  -
        (subtraction operator)

       - L x op y R is a TE if L and R are TEs, x and y are symbol ranges and
        op  is  either => (two-level restriction), <= (two-level coercion), or
        <=> (two-level restriction and coercion).

       - X op L__R is a TE if X, L and R are TEs and op is either ^-> (upward
        replacement), _-> (downward replacement), /->  (leftward  replacement)
        or  \->  (rightward  replacement).  Furthermore,  L  and R must define
        automata  (i.e.  which  map  their  strings  onto  themselves).  These
        operators correspond to Karttunen’s replace operators. If the arrow is
        followed by a question mark (?), the replacement becomes optional.

       - X << l is a TE if X is a TE, and l is either of the form
        a or the form a:b where a and b are single characters or symbols.  The
        result  is  a  transducer  where  l  was  freely  inserted into X. The
        transducer ab << c for instance is equivalent to c*ac*bc*.

       - X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN and R1
        through RN are TEs, and op is  either  =>  (general  restriction),  <=
        (general  coercion),  ^=>  (general surface restriction), ^<= (general
        surface coercion), ^<=> (general surface  restriction  and  coercion),
        _=>  (general  deep  restriction),  _<=  (general deep coercion), _<=>
        (general  deep  restriction  and  coercion).  (These  operators   were
        implemented following a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
        it  into a transducer of the form line1|line2|line3|... where linex is
        the x-th line of the file. All characters  other  than  :  and  \  are
        interpreted  literally  (i.e.  not as operators). This TE is typically
        used e.g. to read morpheme list from a file.

       - "<fname>" is a TE. The compiler reads a pre-compiled transducer from
        the file named fname. This

       Further Features

       Comments start with the symbol % and extend up to the end of the  line.
       Blanks are ignored unless they are quoted. Expressions terminate at the
       end of a line unless the end of line is preceded by a  backslash.   The
       command

       #include "fname"

       can be used to insert source code from a file named fname.  The command

       RE >> "fname"

       stores the regular expression RE in the file fname.

EXAMPLE

       Here is an example of a simple transducer program.  Assuming  that  the
       file "adj-stems" contains the two lines

          easy
          late
          big

       this  transducer  will  correctly  analyse  the  adjective  forms easy,
       easier, easiest and late, later, and latest.

       ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

EXIT STATUS

       fst-compiler returns 0 unless some error occurs.

BUGS

       The compiler gets the operator precedence wrong in  case  of  two-level
       rules and interprets the expression "ab c<=>d ef" as "a(b c<=>d (ef))".
       Therefore, you should always surround the  left  context  of  two-level
       rules with parenthesis: (ab) c<=>d (ef)

AUTHOR

       Helmut Schmid, Institute for Computational Linguistics,  University  of
       Stuttgart,   Email:   schmid@ims.uni-stuttgart.de,   This  software  is
       available under the GNU Public License.

                                 December 2004                 fst-compiler(1)

NAME

SYNOPSIS

OPTIONS

DESCRIPTION

FILE FORMATS

EXAMPLE

EXIT STATUS

BUGS

SEE ALSO

AUTHOR