Man Linux: Main Page and Category List

NAME

       uniconv - convert text to native formats through unicode

SYNOPSIS

       uniconv  -out  output-file [ -decode input-encoding ] [ -encode output-
       encoding ] [ input-file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac
       ]

DESCRIPTION

       uniconv  program  decodes  scripts with a certain encoding encodes them
       with some other encoding.  The scipt is a 16,8 or  7  bit-byte  stream.
       The  converted  text  will be sent to the standard output, even in case
       of 16-bit encodings,unless the output file is  specified  by  the  -out
       option.

       The  -decode and -encode options are optional, the default converter is
       utf-8.  The program reads the Unicode map helper files (*.my) from  the
       default  directory  /usr/share/data.   Simple  1-to-1  encodings can be
       added on the fly by adding a a my-file, or setting your  yudit.datapath
       property            in           ~/.yudit/yudit.properties           or
       /usr/share/yudit/config/yudit.properties.           By          default
       /usr/share/yudit/data is searched.

       My-files  can be created by a program called The files can be converted
       between dos/unix/mac  line-ending  variants  with  -fromdos,  -frommac,
       -todos,  -tomac  options.  the  default  (not  scpecified one) is Unix.
       makeumap.

ENCODING

       If you received this program through the Yudit distribution, then as of
       today you can convert between the encodings below.

       utf-8  Yudit  recommends  this  format  for  international  information
              exchange.  ASCII text  will  get through   intact,  while  other
              unicode characters will get their 8th bit set and the length  of
              the  code  will depend on how far away they are in  the  Unicode
              space.   This  is the only transformation format that can encode
              both 16-bit (ucs-2) and 31-bit (ucs-4) unicode.

              utf-8-s Hackers utf-8 format - it does not give an error message
              when  a  surrogate pair is decoded and it can encode a surrogate
              pair ’as  is’.   This  is  not  a  recommended  encoding  format
              although this format is used to encode/decode clipboard data, in
              order to preserve input.

       utf-16 Although 16 is bigger than 8 this is still  a  compromise
              required by OSes like Windows that can not handle ucs-4 -
              this  encoding  produces  16-bit  unicode  streams.    In
              addition  to  BMP  it  can  convert  16  planes using the
              Unicode Surrogate Area.  This encoding  can  not  convert
              anything above U+10FFFF (Plane 16).  The input byte order
              is recognized by the  first  two  characters  BEM  (byte-
              order-mark) U+FEFF. This format is used in Windows NT for
              documents like notepad .txt files.

       utf-16-be
              Big endian utf-16 converter.

       utf-16-le
              Littlen endian utf-16 converter.

              utf-7 This is the recommended  format  for  international
              information exchange, when 7-bit can only be used. It can
              only handle 16-bit (utf-16)  unicode,  for  ucs-4  (above
              U+10FFFF) you should use utf-8 encoding.

       iso-8859-1
              This  is the ISO 8859-1 character  encoding format. It is
              also known as "Latin-1" encoding.

       iso-8859-2
              This  is  the ISO 8859-2 character encoding format. It is
              also known as "Central European" encoding.

       iso-8859-5
              This  is  the ISO 8859-5 character encoding format. It is
              also known as "Cyrillic" encoding.

       iso-8859-7
              This is the ISO 8859-7 character encoding format.  It  is
              also known as "Greek" encoding.

       iso-8859-9
              This  is  the ISO 8859-9 character encoding format. It is
              also known as "Turkish" encoding.

       koi8-r This is the  KOI8-R  character  encoding  format.  It  is
              mainly used in Russia.

       cp-1251
              This is the CP1251 cyrillic character encoding format. It
              is mainly used in Microsoft Windows and some web sites.

       iso-2022-jp
              This is a Japanese character encoding  format.  It  is  a
              7-bit encoding format.

       iso-2022-jp-3
              This  is  a  Japanese  character encoding format. It is a
              7-bit encoding format.  It  is  base  upon   JIS  X  0213
              standard.

       euc-jp This  is  a  Japanese character encoding format. It is an
              8-bit encoding format.  Mainly used in UNIX systems.

       euc-jp-3
              The official name is EUC-JISX0213 - I just could not read
              this.   This  is a Japanese character encoding format. It
              is a 8-bit encoding format. It is base upon  JIS  X  0213
              standard.

       shift-jis
              This  is  a Japanese character encoding format.  It is an
              8-bit encoding format. Mainly used in MSDOS/Windows.

       shift-jis-3
              The official name is Shift_JISX0213 - I  just  could  not
              read this.  This is a Japanese character encoding format.
              It  is  an  8-bit  encoding  format.   Mainly   used   in
              MSDOS/Windows.

       iso-2022-jp
              This  is a Japanese 7-bit character encoding format.  The
              iso-2022-jp email messages can be decoded/encoded are  in
              this format.

       iso-2022-x11
              This   is  a  Japanese  character encoding format.  It is
              also known as "COMPOUND_TEXT" encoding for the X   Window
              System.  This  is  a  7-bit  encoding  format.  It can be
              derived  from  the   ISO   2022-JP   format   with   some
              differences.

       ksc-5601-x11
              This is a  Korean  character  encoding format used by the
              X  window  system(COMPOUND_TEXT   encoding)   to   encode
              Korean(KS  X  1001) and US-ASCII. This is a 7bit encoding
              format compliant to ISO-2022 specification  for  encoding
              of  multiple  character  sets.  Please, note that this is
              DIFFERENT from ISO-2022-KR (defined in IETF RFC 1557).

       euc-kr This  is  an 8bit  multibyte  encoding  for  Korean.   It
              encodes   US-ASCII(7bit)   in   single   byte  range  and
              characters in KS X 1001(formerly KS  C  5601)  in  double
              byte  range  with  MSB  on(8bit).  It’s  used in Unix and
              Internet. Korean  version of MS-DOS, MacOS and MS-Windows
              use  compatible  (most  cases, identical) variant of this
              encoding.

       johab  This  is  a   Korean   encoding   specified   in   KS   X
              1001(KS  C  5601-1992),    Annex   3  as  a supplementary
              encoding.  Widely used in Korean MS-DOS until mid-1990’s.
              It  can   encode   all Hangul syllables(11,172) of modern
              Korean as well as  all  the  special  symbols  and  Hanja
              (Chinese ideograms used in Korea) defined in KS X 1001.

       uhc    A  variant   of   EUC-KR   used   in   Korean  MS-Windows
              95/98(proprietary  encoding  of   Microsoft,CP949).   Its
              character  repertoire  includes all modern  syllables  of
              Hangul,Korean   script as well as all the special symbols
              and Hanja (Chinese ideograms used in Korea) defined in KS
              X 1001.

       gb-18030
              This is a Chinese character encoding format based upon GB
              18030.   It  encodes  the  whole  U+0000..U+10FFFF range,
              while being compatible with gb-2312.

       gb-2312-x11
              This is a Chinese character encoding format based upon GB
              2312.  It is a 7-bit encoding format.

       gb-2312
              This is a Chinese character encoding format based upon GB
              2312.  It is an 8-bit encoding format.

       big-5  This is a Chinese character encoding  format  based  upon
              BIG5 encoding.  It is an 8-bit encoding format.

       hz     This  is  a  Chinese character encoding format based upon
              "Hanzi" encoding.  It is a 7-bit encoding format.

       viscii This is a Vietnamese character encoding format.

       ucs-2-be
              This converts 16-bit unicode (ucs-2) streams. The  format
              takes   care  of  big-endian  variant.   Yudit  does  not
              recommend this format.

       ucs-2-le
              This converts 16-bit unicode (ucs-2) streams. The  format
              takes  care  of  little-endian  variant.   Yudit does not
              recommend this format.

       ucs-2  This converts 16-bit unicode (ucs-2) streams.  The  input
              byte  order is recognized by the first two characters BEM
              (byte-order-mark) U+FEFF.  Yudit does not recommend  this
              format.

       java   This  converts  \uxxxx  character escapes. When encoding,
              all characters above U+0080 will be escaped with a string
              like  ’\u0080’.  When decoding the same format is decoded
              but, in addition, utf-8 format is also recognized, so  it
              can  also be used to recover data accidentally saved with
              the  wrong  enconding.  The  U+10000..U+10FFFF  area   is
              converted to surrogates and vice versa.

       java-s This  converts  \uxxxx  character escapes. When encoding,
              all characters above U+0080 will be escaped with a string
              like  ’\u0080’.  When decoding the same format is decoded
              but, in addition, utf-8 format is also recognized, so  it
              can  also be used to recover data accidentally saved with
              the wrong enconding. Surrogates are not treated specially
              during  conversion  -  this is why it is not a recommened
              conversion.

FILES

       ~/.yudit/yudit.properties                                     or
       /usr/share/yudit/config/yudit.properties
              can have yudit.datapath property. This is where  the  map
              files  are  kept.   By  default  /usr/share/yudit/data is
              searched.

SEE ALSO

        makeumap

AUTHOR

       This program  was written by  gsinai@yudit.org  (Gaspar  Sinai),
       Tokyo, 2 January, 2001.