NAME
kcc - Kanji code coverter with encoding auto detection
SYNOPSIS
kcc [ -IOchnvxz ] [ -b bufsize ] [ file ] ...
DESCRIPTION
kcc is a filter that reads file sequencially, converts kanji encodings
and output to stdout. If no file is specified, or specified - as
filename, it read from stdin. You can specify kanji encodings for
input/output. However, kcc detect input encodig automatically, if you
don’t specify input encoding.
Available kanji encodings are JIS (7 bit and/or 8 bit), Shift
JISEUCDEC. For input encoding, you can mix when these are pair of one
of EUC DEC or Shift JIS and 7 bit JIS. SI/SOESC(I are recognized as
halfwidth of JIS.
OPTIONS
-O
-IO I for input kanji encoding¡¤O for output kanji encoding. When
no input encoding specified, it will be detected automatically,
and if both of input/output aren’t specified, output encoding is
7 bit JIS.
You can specify one of the followings for the input encoding
option, I.
e EUC(available with 7 bit JIS )
d DEC(available with 7 bit JIS )
s Shift JIS(available with 7 bit JIS )
j7 or k
7 bit JIS
8 8 bit JIS
You can specify one of the followings for output encoding
option, O.
e EUC
d DEC
s Shift JIS
jXY or 7XY
7 bit JIS(usingSI/SO for JIS kana designation)
kXY 7 bit JIS(usingESC(I for JIS kana designation)
8XY 8 bit JIS
By XY in O option, You can specify which escape sequence used in
JIS encoding. BJ is default. Supplimental kanji designation
is fixed to ESC$(D
X Kanji is designated by:
B ESC$B(JIS X0208-1983)
@ ESC$@(JIS X0208-1978)
+ ESC&@ESC$B(JIS X0212-1990)
Y Alpha Numerical is designated by:
B ESC(B(ASCII)
J ESC(J(JIS Roman; JIS X0201)
H ESC(H(Swedish; strongly deprecated)
-v outputs result of input encoding detection to stderr.
-x Extension mode. By auto detection of input encodings, recognize
user-defined characters and extended character region ( out of
range of EUC, undefined halfwidth kana, control character, C1
area and/or extended character region Shift C1 JIS ).
Distinguish between DEC and EUC is done in this mode.
-z Shrink mode. Don’t recognize halfwidth kana (except 7 bit JIS )
with input encoding detection. With this option, accuracy of
auto detection of input encodings becomes much better for file
without halfwidth kana.
-h Normally, When converted halfwidth kana to DEC , it becomes
fullwidth Katakana. With this option, it becomes Hiragana.
-n user-defined characters, extended characters and supplimental
kanji characters areconverted to fullwidth white box, and
undefined region of halfwidth kana are converted to halfwidth
centered dot.
-b bufsize
specify buffer size. 8kbytes is default.
-c don’t convert but check input encoding and print result to
stdout. Different with normal auto-detection, whole contents
of file is checked. However, when inconsistency of encodings is
found, abort reading and print "data". Options except -x¡¤-z
are ignored.
EXAMPLES
% kcc -e file
Input encoding are detect automatically, and output is in EUC
encoding.
% kcc -sj file1 file2
Two files in Shift JIS concatinated with converting to JIS.
% command | kcc -k+J
output of command are converted to JIS(JIS JIS X0208 JIS JIS
Roman¡¤ESC(I Halfwidth Kana JIS )
% kcc -c file
Encoding of contents of file is detected(no conversion)
BUG
Auto detection of input encoding is well done for normal case, however,
it has the following problems.
7 bit JIS is recognized by escape sequence in certain. EUC and DEC are
the same (refered as EUC series). Halfwidth kana of 8 bit JIS is the
same as halfwidth kana of Shift JIS (refered as Shift JIS series).
However, EUC series and JIS , which are both 8 bit encoding, are
sharing the same regions widely. So, the problem in auto detection is
detection of these 2 encodings.
Detection of EUC series/Shift JIS series is done in line by line, When
it is found that it’s not Shift JIS series, or it’s not EUC series,
encoding is determined. When inconsistensy found, it will be treated
as "data" and contents of output is not guaranteed.
While determined between EUC series/Shift JIS series after 8bit code
found, conversions are pending and put input data in buffer, however,
buffer is fulled, it assumes it’s EUC series and forces to start
conversion. Rationale. Usually, we can assume that documents with kanji
include JIS non-kanji or JIS first standard, it can be detected in
certain if it is Shift JIS , which does not share region with EUC. So
if it can’t be determined, it’s very likely to be EUC.
8 bit JIS and it has always even number of halfwidth kana sequences,
then it will be wrongly detected as EUC kanji. Be ceraful.
If input encoding doesn’t have halfwidth kana, use -z and accuracy of
detection become much better. This is because shared region are
restricted to area of JIS second standards.
Extended region of Shift JIS user-defined area of EUC, control
characters C1 of EUC, undefined region of halfwidth kana of EUC are out
of range of auto detection, so it will fails to detect encodings if
input has these characters. Use -x option to specify extended mode, or
specify input code.
SEE ALSO
cat(1)
NOTES
Usually, user-defined characters, extended characters, supplimental
kanji characters are mapped respectively. However characters that is
out of range of extended characters become FCFC in hexadecimal when
converted to Shift JIS. Although control character region C1 of EUC
and DEC remains when converted to JIS , these will be deleted when
converted to Shift JIS Undefined area of halfwidth kana become
halfwidth centered dot when convered to Shift JIS Halfwidth kana become
fullwidth kana when converted to DEC.
When output is JIS encoding, control characters such as newline, TAB,
DEL and white space (halfwidth) will be output in ASCII mode.
When encoding of input is detected wrongly, or input undefined
character for expected character sets, output is indefined.
This manual are translated by Fumitoshi UKAI <ukai@debian.or.jp> for
Debian system, but you can use it for any purpose.