Man Linux: Main Page and Category List

NAME

       kcc - Kanji code coverter with encoding auto detection

SYNOPSIS

       kcc [ -IOchnvxz ] [ -b bufsize ] [ file ] ...

DESCRIPTION

       kcc  is a filter that reads file sequencially, converts kanji encodings
       and output to stdout.  If no file  is  specified,  or  specified  -  as
       filename,  it  read  from  stdin.   You can specify kanji encodings for
       input/output. However, kcc detect input encodig automatically,  if  you
       don’t specify input encoding.

       Available  kanji  encodings  are  JIS  (7  bit  and/or  8  bit),  Shift
       JISEUCDEC.  For input encoding, you can mix when these are pair of  one
       of  EUC  DEC  or Shift JIS and 7 bit JIS.  SI/SOESC(I are recognized as
       halfwidth of JIS.

OPTIONS

       -O
       -IO    I for input kanji encoding¡¤O for output kanji  encoding.   When
              no  input encoding specified, it will be detected automatically,
              and if both of input/output aren’t specified, output encoding is
              7 bit JIS.

              You  can  specify  one  of the followings for the input encoding
              option, I.

                 e      EUC(available with 7 bit JIS )
                 d      DEC(available with 7 bit JIS )
                 s      Shift JIS(available with 7 bit JIS )
                 j7 or k
                        7 bit JIS
                 8      8 bit JIS

              You can specify  one  of  the  followings  for  output  encoding
              option, O.

                 e      EUC
                 d      DEC
                 s      Shift JIS
                 jXY or 7XY
                        7 bit JIS(usingSI/SO for JIS kana designation)
                 kXY    7 bit JIS(usingESC(I for JIS kana designation)
                 8XY    8 bit JIS

              By XY in O option, You can specify which escape sequence used in
              JIS encoding.  BJ is default.   Supplimental  kanji  designation
              is fixed to ESC$(D

                 X      Kanji is designated by:
                      B      ESC$B(JIS X0208-1983)
                      @      ESC$@(JIS X0208-1978)
                      +      ESC&@ESC$B(JIS X0212-1990)
                 Y      Alpha Numerical is designated by:
                      B      ESC(B(ASCII)
                      J      ESC(J(JIS Roman; JIS X0201)
                      H      ESC(H(Swedish; strongly deprecated)

       -v     outputs result of input encoding detection to stderr.

       -x     Extension mode.  By auto detection of input encodings, recognize
              user-defined characters and extended character region (  out  of
              range  of  EUC,  undefined halfwidth kana, control character, C1
              area  and/or  extended  character  region  Shift   C1   JIS   ).
              Distinguish between DEC and EUC is done in this mode.

       -z     Shrink  mode. Don’t recognize halfwidth kana (except 7 bit JIS )
              with input encoding detection.  With this  option,  accuracy  of
              auto  detection  of input encodings becomes much better for file
              without halfwidth kana.

       -h     Normally, When converted halfwidth kana  to  DEC  ,  it  becomes
              fullwidth Katakana.  With this option, it becomes Hiragana.

       -n     user-defined  characters,  extended  characters and supplimental
              kanji  characters  areconverted  to  fullwidth  white  box,  and
              undefined  region  of  halfwidth kana are converted to halfwidth
              centered dot.

       -b bufsize
              specify buffer size.  8kbytes is default.

       -c     don’t convert but check  input  encoding  and  print  result  to
              stdout.   Different  with normal auto-detection,  whole contents
              of file is checked.  However, when inconsistency of encodings is
              found,  abort  reading  and print "data".  Options except -x¡¤-z
              are ignored.

EXAMPLES

       % kcc -e file
              Input encoding are detect automatically, and output  is  in  EUC
              encoding.

       % kcc -sj file1 file2
              Two files in Shift JIS concatinated with converting to JIS.

       % command | kcc -k+J
              output  of  command  are  converted to JIS(JIS JIS X0208 JIS JIS
              Roman¡¤ESC(I Halfwidth Kana JIS )

       % kcc -c file
              Encoding of contents of file is detected(no conversion)

BUG

       Auto detection of input encoding is well done for normal case, however,
       it has the following problems.

       7 bit JIS is recognized by escape sequence in certain.  EUC and DEC are
       the same (refered as EUC series).  Halfwidth kana of 8 bit JIS  is  the
       same  as  halfwidth  kana  of  Shift JIS (refered as Shift JIS series).
       However, EUC series and JIS ,  which  are  both  8  bit  encoding,  are
       sharing  the same regions widely.  So, the problem in auto detection is
       detection of these 2 encodings.

       Detection of EUC series/Shift JIS series is done in line by line,  When
       it  is  found  that  it’s not Shift JIS series, or it’s not EUC series,
       encoding is determined.  When inconsistensy found, it will  be  treated
       as "data" and contents of output is not guaranteed.

       While  determined  between  EUC series/Shift JIS series after 8bit code
       found,  conversions are pending and put input data in buffer,  however,
       buffer  is  fulled,  it  assumes  it’s  EUC  series and forces to start
       conversion. Rationale. Usually, we can assume that documents with kanji
       include  JIS  non-kanji  or  JIS  first standard, it can be detected in
       certain if it is Shift JIS , which does not share region with EUC.   So
       if it can’t be determined, it’s very likely to be EUC.

       8  bit  JIS  and it has always even number of halfwidth kana sequences,
       then it will be wrongly detected as EUC kanji. Be ceraful.

       If input encoding doesn’t have halfwidth kana, use -z and  accuracy  of
       detection  become  much  better.   This  is  because  shared region are
       restricted to area of JIS second standards.

       Extended  region  of  Shift  JIS  user-defined  area  of  EUC,  control
       characters C1 of EUC, undefined region of halfwidth kana of EUC are out
       of range of auto detection, so it will fails  to  detect  encodings  if
       input has these characters.  Use -x option to specify extended mode, or
       specify input code.

SEE ALSO

       cat(1)

NOTES

       Usually, user-defined  characters,  extended  characters,  supplimental
       kanji  characters  are  mapped respectively. However characters that is
       out of range of extended characters become  FCFC  in  hexadecimal  when
       converted  to  Shift  JIS.  Although control character region C1 of EUC
       and DEC remains when converted to JIS ,  these  will  be  deleted  when
       converted  to  Shift  JIS  Undefined  area  of  halfwidth  kana  become
       halfwidth centered dot when convered to Shift JIS Halfwidth kana become
       fullwidth kana when converted to DEC.

       When  output  is JIS encoding, control characters such as newline, TAB,
       DEL and white space (halfwidth) will be output in ASCII mode.

       When  encoding  of  input  is  detected  wrongly,  or  input  undefined
       character for expected character sets, output is indefined.

       This  manual  are  translated by Fumitoshi UKAI <ukai@debian.or.jp> for
       Debian system, but you can use it for any purpose.