NAME
uniconv - convert text to native formats through unicode
SYNOPSIS
uniconv -out output-file [ -decode input-encoding ] [ -encode output-
encoding ] [ input-file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac
]
DESCRIPTION
uniconv program decodes scripts with a certain encoding encodes them
with some other encoding. The scipt is a 16,8 or 7 bit-byte stream.
The converted text will be sent to the standard output, even in case
of 16-bit encodings,unless the output file is specified by the -out
option.
The -decode and -encode options are optional, the default converter is
utf-8. The program reads the Unicode map helper files (*.my) from the
default directory /usr/share/data. Simple 1-to-1 encodings can be
added on the fly by adding a a my-file, or setting your yudit.datapath
property in ~/.yudit/yudit.properties or
/usr/share/yudit/config/yudit.properties. By default
/usr/share/yudit/data is searched.
My-files can be created by a program called The files can be converted
between dos/unix/mac line-ending variants with -fromdos, -frommac,
-todos, -tomac options. the default (not scpecified one) is Unix.
makeumap.
ENCODING
If you received this program through the Yudit distribution, then as of
today you can convert between the encodings below.
utf-8 Yudit recommends this format for international information
exchange. ASCII text will get through intact, while other
unicode characters will get their 8th bit set and the length of
the code will depend on how far away they are in the Unicode
space. This is the only transformation format that can encode
both 16-bit (ucs-2) and 31-bit (ucs-4) unicode.
utf-8-s Hackers utf-8 format - it does not give an error message
when a surrogate pair is decoded and it can encode a surrogate
pair ’as is’. This is not a recommended encoding format
although this format is used to encode/decode clipboard data, in
order to preserve input.
utf-16 Although 16 is bigger than 8 this is still a compromise
required by OSes like Windows that can not handle ucs-4 -
this encoding produces 16-bit unicode streams. In
addition to BMP it can convert 16 planes using the
Unicode Surrogate Area. This encoding can not convert
anything above U+10FFFF (Plane 16). The input byte order
is recognized by the first two characters BEM (byte-
order-mark) U+FEFF. This format is used in Windows NT for
documents like notepad .txt files.
utf-16-be
Big endian utf-16 converter.
utf-16-le
Littlen endian utf-16 converter.
utf-7 This is the recommended format for international
information exchange, when 7-bit can only be used. It can
only handle 16-bit (utf-16) unicode, for ucs-4 (above
U+10FFFF) you should use utf-8 encoding.
iso-8859-1
This is the ISO 8859-1 character encoding format. It is
also known as "Latin-1" encoding.
iso-8859-2
This is the ISO 8859-2 character encoding format. It is
also known as "Central European" encoding.
iso-8859-5
This is the ISO 8859-5 character encoding format. It is
also known as "Cyrillic" encoding.
iso-8859-7
This is the ISO 8859-7 character encoding format. It is
also known as "Greek" encoding.
iso-8859-9
This is the ISO 8859-9 character encoding format. It is
also known as "Turkish" encoding.
koi8-r This is the KOI8-R character encoding format. It is
mainly used in Russia.
cp-1251
This is the CP1251 cyrillic character encoding format. It
is mainly used in Microsoft Windows and some web sites.
iso-2022-jp
This is a Japanese character encoding format. It is a
7-bit encoding format.
iso-2022-jp-3
This is a Japanese character encoding format. It is a
7-bit encoding format. It is base upon JIS X 0213
standard.
euc-jp This is a Japanese character encoding format. It is an
8-bit encoding format. Mainly used in UNIX systems.
euc-jp-3
The official name is EUC-JISX0213 - I just could not read
this. This is a Japanese character encoding format. It
is a 8-bit encoding format. It is base upon JIS X 0213
standard.
shift-jis
This is a Japanese character encoding format. It is an
8-bit encoding format. Mainly used in MSDOS/Windows.
shift-jis-3
The official name is Shift_JISX0213 - I just could not
read this. This is a Japanese character encoding format.
It is an 8-bit encoding format. Mainly used in
MSDOS/Windows.
iso-2022-jp
This is a Japanese 7-bit character encoding format. The
iso-2022-jp email messages can be decoded/encoded are in
this format.
iso-2022-x11
This is a Japanese character encoding format. It is
also known as "COMPOUND_TEXT" encoding for the X Window
System. This is a 7-bit encoding format. It can be
derived from the ISO 2022-JP format with some
differences.
ksc-5601-x11
This is a Korean character encoding format used by the
X window system(COMPOUND_TEXT encoding) to encode
Korean(KS X 1001) and US-ASCII. This is a 7bit encoding
format compliant to ISO-2022 specification for encoding
of multiple character sets. Please, note that this is
DIFFERENT from ISO-2022-KR (defined in IETF RFC 1557).
euc-kr This is an 8bit multibyte encoding for Korean. It
encodes US-ASCII(7bit) in single byte range and
characters in KS X 1001(formerly KS C 5601) in double
byte range with MSB on(8bit). It’s used in Unix and
Internet. Korean version of MS-DOS, MacOS and MS-Windows
use compatible (most cases, identical) variant of this
encoding.
johab This is a Korean encoding specified in KS X
1001(KS C 5601-1992), Annex 3 as a supplementary
encoding. Widely used in Korean MS-DOS until mid-1990’s.
It can encode all Hangul syllables(11,172) of modern
Korean as well as all the special symbols and Hanja
(Chinese ideograms used in Korea) defined in KS X 1001.
uhc A variant of EUC-KR used in Korean MS-Windows
95/98(proprietary encoding of Microsoft,CP949). Its
character repertoire includes all modern syllables of
Hangul,Korean script as well as all the special symbols
and Hanja (Chinese ideograms used in Korea) defined in KS
X 1001.
gb-18030
This is a Chinese character encoding format based upon GB
18030. It encodes the whole U+0000..U+10FFFF range,
while being compatible with gb-2312.
gb-2312-x11
This is a Chinese character encoding format based upon GB
2312. It is a 7-bit encoding format.
gb-2312
This is a Chinese character encoding format based upon GB
2312. It is an 8-bit encoding format.
big-5 This is a Chinese character encoding format based upon
BIG5 encoding. It is an 8-bit encoding format.
hz This is a Chinese character encoding format based upon
"Hanzi" encoding. It is a 7-bit encoding format.
viscii This is a Vietnamese character encoding format.
ucs-2-be
This converts 16-bit unicode (ucs-2) streams. The format
takes care of big-endian variant. Yudit does not
recommend this format.
ucs-2-le
This converts 16-bit unicode (ucs-2) streams. The format
takes care of little-endian variant. Yudit does not
recommend this format.
ucs-2 This converts 16-bit unicode (ucs-2) streams. The input
byte order is recognized by the first two characters BEM
(byte-order-mark) U+FEFF. Yudit does not recommend this
format.
java This converts \uxxxx character escapes. When encoding,
all characters above U+0080 will be escaped with a string
like ’\u0080’. When decoding the same format is decoded
but, in addition, utf-8 format is also recognized, so it
can also be used to recover data accidentally saved with
the wrong enconding. The U+10000..U+10FFFF area is
converted to surrogates and vice versa.
java-s This converts \uxxxx character escapes. When encoding,
all characters above U+0080 will be escaped with a string
like ’\u0080’. When decoding the same format is decoded
but, in addition, utf-8 format is also recognized, so it
can also be used to recover data accidentally saved with
the wrong enconding. Surrogates are not treated specially
during conversion - this is why it is not a recommened
conversion.
FILES
~/.yudit/yudit.properties or
/usr/share/yudit/config/yudit.properties
can have yudit.datapath property. This is where the map
files are kept. By default /usr/share/yudit/data is
searched.
SEE ALSO
makeumap
AUTHOR
This program was written by gsinai@yudit.org (Gaspar Sinai),
Tokyo, 2 January, 2001.