konwert - interface for various character encoding conversions

NAME

       konwert - interface for various character encoding conversions

SYNOPSIS

       konwert FILTER [FILE]... [-o DEST | -O]

DESCRIPTION

       Konwert  allows  filtering multiple files through multiple filters.  It
       filters the specified FILEs, or stdin if none are given.

       Simple FILTER is the name of an  executable  file  from  the  directory
       ~/.konwert/filters     or     the     system-wide     one,     normally
       /usr/share/konwert/filters.   Such  program  itself  filters  stdin  to
       stdout.

       The filtering rule can be more complex:

       konwert FILTER1+FILTER2 means konwert FILTER1 | konwert FILTER2.

       konwert  FORMAT1-FORMAT2,  unless  such  filter exists, tries to find a
       common  FORMAT3,   such   that   both   filters   FORMAT1-FORMAT3   and
       FORMAT3-FORMAT1 do exist.

       konwert  FILTER/ARG/...  passes  arguments to the filter. Arguments can
       also be specified here: FORMAT1/ARGS-FORMAT2.  The meaning of arguments
       depends on the particular filter.

       konwert ’(COMMAND ARGS...)’ executes this arbitrary shell command. This
       is useful with -o or -O options. The command cannot contain the  string
       )+, which will terminate this filter’s specification.

   OPTIONS
       -o DEST   output goes to this file/directory instead of stdout

       -O        every input file is replaced with its translation

       --help    display help and exit

       --version output version information and exit

       Redirecting  output  to  one  of  the  source files with either -o or >
       instead of -O will corrupt it! Option -O creates a  temporary  file  in
       /tmp and later copies it back onto the source.

CHARACTER ENCODING CONVERSIONS

       You  can  convert  text  between  any two charsets, for example konwert
       cp437-iso2.

       Characters unavailable in the target charset will be  substituted  with
       approximations  with  available  ones.  The  approximations need not be
       single characters.

       The following character sets are currently supported:

       ascii  7bit ASCII

       utf8 = unicode  Unicode UTF-8

       iso1 = isolatin1
              ISO-8859-1 aka ISO Latin 1 (Western European)
       iso2 = isolatin2
              ISO-8859-2 aka ISO Latin 2 (Central European)
       iso3 = isolatin3
              ISO-8859-3 aka ISO Latin 3 (Esperanto)
       iso4 = isolatin4
              ISO-8859-4 aka ISO Latin 4 (Baltic)
       iso5 = isolatincyr
              ISO-8859-5 (Cyrillic)
       iso6 = isolatinarabic
              ISO-8859-6 (Arabic)
       iso7 = isolatingreek
              ISO-8859-7 (Greek)
       iso8 = isolatinhebrew
              ISO-8859-8 (Hebrew)
       iso9 = isolatin5 = isolatintur
              ISO-8859-9 aka ISO Latin 5 (Turkish)
       iso10 = isolatin6 = isolatinnordic
              ISO-8859-10 aka ISO Latin 6 (Nordic)
       iso12 = isolatin7 = isolatinceltic
              ISO-8859-12 aka ISO Latin 6 (Celtic) - Draft
       iso13 = isolatin8 = isolatinbaltic
              ISO-8859-13 aka ISO Latin 6 (Baltic) - Draft
       iso14 = isolatin9 = isolatinsami
              ISO-8859-14 aka ISO Latin 6 (Sámi) - Draft
       iso15  ISO-8859-15 - Draft

       koi8r    KOI8-R (Russian)
       koi8u    KOI8-U (Ukrainian, Byelorussian)
       koi8uni  KOI8-Uni (Cyrillic)

       cp1250 = wince = winlatin2    Windows CP-1250 aka Win Latin 2  (Central
                                     European)
       cp1251 = wincyr               Windows CP-1251 (Cyrillic)
       cp1252 = winwest = winlatin1  Windows  CP-1252 aka Win Latin 1 (Western
                                     European)
       cp1253 = wingr                Windows CP-1253 (Greek)
       cp1254 = wintur               Windows CP-1254 (Turkish)
       cp1255 = winhebrew            Windows CP-1255 (Hebrew)
       cp1256 = winarabic            Windows CP-1256 (Arabic)
       cp1257 = winbaltic            Windows CP-1257 (Baltic)
       cp1258 = winviet              Windows CP-1258 (Vietnamese)

       cp437 = icmeng               DOS CP-437 (English)
       cp737 = dosgreek             DOS CP-737 (Greek)
       cp775 = dosbaltic            DOS CP-775 (Baltic)
       cp850 = doswest = doslatin1  DOS  CP-850  aka  DOS  Latin  1   (Western
                                    European)
       cp852 = dosce = doslatin2    DOS   CP-852  aka  DOS  Latin  2  (Central
                                    European)
       cp855 = doscyr               DOS CP-855 (Cyrillic)
       cp857 = dostur               DOS CP-857 (Turkish)
       cp860 = dosportugal          DOS CP-860 (Portugal)
       cp861 = dosiceland           DOS CP-861 (Icelandic)
       cp862 = doshebrew            DOS CP-862 (Hebrew)
       cp863 = doscanadfr           DOS CP-863 (Canadian French)
       cp864 = dosarabic            DOS CP-864 (Arabic)
       cp865 = dosnordic            DOS CP-865 (Nordic)
       cp866 = dosrussian           DOS CP-866 (Russian)
       cp869 = dosgreek2            DOS CP-869 (Greek2)
       cp874 = dosthai              DOS CP-874 (Thai)

       mac         Macintosh Roman (Western European)
       macce       Macintosh Central European
       maccyr      Macintosh Cyrillic
       macgreek    Macintosh Greek
       maciceland  Macintosh Icelandic
       mactur      Macintosh Turkish

       csk,
       cyfromat,
       dhn,
       fidomazovia,
       iea,
       logic,
       mazovia,
       microvex     DOS charsets for Polish

       amigapl,
       fat,
       xjp      Amiga charsets for Polish

       kamenicky  DOS charset for Czech and Slovak

       wingreek  WinGreek (Windows font-based encoding for ancient Greek)

       babelpl  TeX [polish]{babel}: "a"c"e"l"n"o"s"z"r
       ciachy   TeX \prefixing: /a/c/e/l/n/o/s/x/z

       xmetodo        Esperanto: cx gx hx jx sx ux (vx w)
       hmetodo        Esperanto: ch gh hh jh sh u
       antauxcxap     Esperanto: ^c ^g ^h ^j ^s ^u (~u)
       postcxap       Esperanto: c^ g^ h^ j^ s^ u^ (u~)
       apostrofoj     Esperanto: c g h j s u
       malapostrofoj  Esperanto: c g h j s u

       viscii  VISCII (Vietnamese)
       viqri   Vietnamese Quoted Readable Implicit

       htmldec  SGML/HTML character references (decimal): &#198; &#283;
                &#8594;
       htmlhex  SGML/HTML  character  references  (hexadecimal): &#xC6;
                &#x11B; &#x2192;
       htmlent  SGML/HTML character entities (names):  &AElig;  &ecaron
                &rarr;
       html     All three above (only as input format)

       tex    TeX  with  some  LaTeX or AMS-TeX extensions. There is no
              distinction between normal  and  math  mode  -  you  will
              probably have to insert some $’s manually.

       mnemonic   RFC 1345 mnemonics preceded by &
       mnemonic1  RFC 1345 mnemonics preceded by 

       any/LANGUAGE (e.g. any/pl-iso2)
              This  special  input  format  will  detect  the  encoding
              automatically, basing on the  frequencies  of  characters
              found in text. Every language is associated with a set of
              possible encodings used for it and average frequencies of
              its  letters  (excluding ASCII letters). The best fitting
              encoding is  used  for  conversion.  Currently  supported
              languages  are  cs  (Czech),  de (German), el (Greek), eo
              (Esperanto), es (Spanish), fr (French), he  (Hebrew),  it
              (Italian),  pl  (Polish),  pt (Portuguese), ru (Russian),
              and sv (Swedish).

       varpl  Mixed Polish ISO-8859-2, CP-1250, and UTF-8. If  you  are
              reading  Polish  newsgroups  I  suggest  putting  it as a
              filter in your newsreader  (for  speed  improvement  it’s
              better to call it directly, rather than through konwert).

       vareo  Mixed various Esperanto encodings.

OPTIONS CONTROLLING THE ABOVE CONVERSIONS

       /1 (e.g. konwert iso2-ascii/1)
              Each unavailable character will be replaced only  with  a
              single  approximate char, not string. This is useful with
              the filterm  program  or  with  preformatted  text.  This
              option  is  automatically turned on when a filter is used
              as output for filterm.

       /html  Text is assumed to be  HTML.  The  characters  "  &  <  >
              resulting  from  other characters’ approximations will be
              properly escaped as &quot; &amp; &lt;  &gt;.   The  <META
              http-equiv="content-type"             content="text/html;
              charset=..."> header will be fixed if present.

       /htmldec
              Convert META as above.  Unavailable  characters  will  be
              encoded in &#Unicode;.

       /htmlhex
              Convert  META  as  above.  Unavailable characters will be
              encoded in hexadecimal &#xUnicode;.

       /tex   Unavailable  characters  will  be   described   in   TeX.
              Characters  #  $  %  &  \ ^ _ { | } ~ resulting from some
              characters’ approximations will be properly escaped  into
              \# \$ \% \& $\backslash$ \^{} \_ \{ $|$ \} \~{}.

       /asciichar
              Recognizes some ASCII representations of characters, e.g.
              (c) ... 1/2 >=.

       /rosyjski
              Russian text will be replaced with  its  Polish  phonetic
              transcription.

       Some  output  filters  can  use  the  language  information  for
       choosing  better  approximations  of  unavailable  letters,  for
       example /de (German):  → ae instead of a.

OTHER FILTERS

       any/LANGUAGE-test
              Detects the encoding, but instead of text conversion only
              shows the encoding’s name.  The  additional  option  /all
              shows all possible encodings, sorted from better to worse
              ones.

       cr
       lf
       crlf   Force  specific  end-of-line  marker  convention.   cr  =
              Macintosh,  lf  = Unix and Amiga, crlf = Windows and DOS.
              The input convention is detected automatically.

       expand Expands tabs into  spaces  (uses  the  textutils  program
              expand).

       unexpand
              Compresses  spaces  into tabs (uses the textutils program
              unexpand).

       rmspacesateol
              Removes spaces and tabs at end of line.

       qp-8bit
       8bit-qp
              MIME Quoted Printable encoding: =A3=F3d=BC.

       rtf-8bit
       8bit-rtf
              Rich Text Format: \a3\f3d\9f.

       txt-htmlchar
              Escapes " & < > into SGML/HTML entities &quot; &amp; &lt;
              &gt;.  Useful for including a text file inside HTML <PRE>
              </PRE> tags.

       htmlchar-txt
              Reverse.

       rot13  Guvf vf n qrzbafgengvba bs ebg13.

       toupper
       tolower
              Self-explanatory. Currently ASCII only.

       prn7pl Converts polish chars to  control  sequences  for  EPSON-
              compatible  printer.  Using only 7-bit chars, backspacing
              printer’s head and vertical  positioning  chars  ,.’‘  it
              creates  pseudo-polish  gryphs.  You can specify options:
              /nlq  (default)  which  optimalizes  output  for   better
              quality  printers and /draft - useful for ex. for 9-nails
              printer.

FILES

       /usr/share/konwert/filters/*
       ~/.konwert/filters/*

BUGS

       APPLE character in mac* charsets, and CH and  ch  characters  in
       koi8cs  are  not  preserved  in  conversion  even  when they are
       available. Also they don’t respect the /1 option.  Reason:  they
       are not in Unicode.

COPYRIGHT

       Konwert  is  a  package for conversion between various character
       encodings.

       Copyright (c) 1998 Marcin ’Qrczak’ Kowalczyk

       This program is free software; you can  redistribute  it  and/or
       modify  it  under the terms of the GNU General Public License as
       published by the Free Software Foundation; either version  2  of
       the License, or (at your option) any later version.

       This  program is distributed in the hope that it will be useful,
       but WITHOUT ANY WARRANTY; without even the implied  warranty  of
       MERCHANTABILITY  or  FITNESS  FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.

       You should have received  a  copy  of  the  GNU  General  Public
       License  along  with  this  program;  if  not, write to the Free
       Software Foundation, Inc., 59 Temple Place, Suite  330,  Boston,
       MA  02111-1307  USA

AUTHOR

        __("<   Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.home.ml.org/
        \__/       GCS/M d- s+:-- a21 C+++>+++$ UL++>++++$ P+++ L++>++++$ E->++
         ^^                W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP->+ t
       QRCZAK                  5? X- R tv-- b+>++ DI D- G+ e>++++ h! r--%>++ y-

NAME

SYNOPSIS

DESCRIPTION

CHARACTER ENCODING CONVERSIONS

OPTIONS CONTROLLING THE ABOVE CONVERSIONS

OTHER FILTERS

FILES

SEE ALSO

BUGS

COPYRIGHT

AUTHOR