Man Linux: Main Page and Category List

NAME

       msort - sort records in complex ways

SYNOPSIS

       msort <options> [<input file>]

DESCRIPTION

       msort  is  a  program for sorting text files in sophisticated ways.  It
       was developed initially for alphabetizing dictionaries of languages  in
       which  the  ordering  may  be quite different from English but has many
       other uses.

       msort allows you to sort blocks of text delimited in a number  of  ways
       rather  than just lines and to specify particular fields of a record as
       sort keys using either their position, counted from either end,  or  by
       matching regular expressions to their tags.

       msort  is capable of sorting on multiple keys, so that when two records
       tie on one key, the tie may be broken on another. Any or all  keys  may
       be  optional.   How  absent  optional  keys are ordered with respect to
       present keys may be set separately for each key.

       msort allows you  to  specify  arbitrary  sort  orders  and  to  define
       virtually  unlimited  numbers  of  multigraphs of effectively unlimited
       length.  The sort order and multigraphs are defined separately for each
       key.  If  your  system  has  locale  support,  you  can also use locale
       collation rules instead of specify your own sort order.

       msort provides twelve types of key comparison: lexicographic,  numeric,
       numeric  string, hybrid, by string length, by angle, by date, by domain
       name, by time, by ISO8601 date/time stamp, by month name, and random.

       What month names are used is a bit complicated. If the -s flag is  used
       on the same key and its argument is the name of a file, the month names
       are read from the file, which should be in the same format  as  a  sort
       order  definition  file.  If  the -s flag is used and its argument is a
       locale name, the month names recognized will be  the  month  names  and
       abbreviations  associated  with the specified locale. If the -s flag is
       not used the month  names  recognized  will  be  the  month  names  and
       abbreviations  associated  with the current locale. If your system does
       not have locale support and you do not use the  -s  flag  to  read  the
       month names from a file, the month names recognized will be the English
       month names and abbreviations.

       msort can reverse the characters in a key, allowing it to  be  used  to
       generate reverse dictionaries.

       A choice of sorting algorithms is provided.

       msort   fully  supports  Unicode.  The  text  to  be  sorted,  and  all
       specifications, should be in UTF-8 Unicode. (If you  have  plain  ASCII
       text,  this  is  not  a  problem as ASCII is a subset of Unicode.) Full
       Unicode case-folding is available, in Turkic and  non-Turkic  variants.
       Unicode normalization is performed before sorting.

       For usage information, execute msort with no arguments.

       Full  information about msort is currently to be found in the reference
       manual, which is distributed as a PDF (Portable Document Format)  file.
       If  a  copy  is not available locally, you can download it from msort’s
       home page:
       http://billposer.org/Software/msort.html

OPTIONS

   Informational options
       -h,--help
              Print usage message

       -v,--version
              Print version message

       -D,--defaults
              List defaults

       -F,--general-options
              List general command line options

       -G,--gnu-equivalences
              List equivalents for GNU sort command line options.

       -H,--informational-options
              List informational command line options

       -K,--key-specific-options
              List key-specific command line options

       -L,--limits
              List limits

       -N,--number-systems
              List the supported number systems.

   General options
       -b,--block
              A record is terminated by two or more newlines

       -l,--line
              A record consists of a single line

       -r,--record-separator <separator>
              A record is terminated by separator character

       -O,--fixed-size-record <bytes>
              A record consists of the specified number of bytes.

       -d,--field-separators <character>+
              Fields are delimited by the named character(s)

       -w,--whole
              Sort on the entire text of the record

       -a,--algorithm <algorithm>
              Use   the   specified   sort   algorithm.   The   choices   are:
              I(nsertionSort),   M(ergeSort),  Q(uickSort),  and  S(hellSort).
              Note  that  InsertionSort  and  MergeSort  are   stable,   while
              QuickSort  and ShellSort are unstable. The default is QuickSort.

       -M,-initial-maximum-records <records>
              Set initial maximum number of records

       -m,--line-end-carriage-return
              End-of-line in the input  data  is  marked  by  Carriage  Return
              (0x0D) as on the Macintosh rather than by Line Feed (0x0A) as on
              Unix systems.

       -I,--invert-globally
              Invert sense of comparisons globally

       -B,--BMP
              No characters fall outside the Basic Multingual Plane (that  is,
              have values greater than 0xFFFF).

       -p,--reserve-private-use-area
              Do  not  make internal use of the Private Use areas. By default,
              multigraphs  are  assigned  internally  to  codepoints  in   the
              Supplementary  Private Use areas if full Unicode is in use or to
              codepoints in the Private Use area if input is restricted to the
              Basic  Multilingual  Plane  by  means  of the -B option. If your
              input makes use of the Private Use areas, this  option  prevents
              interference  with your input. In this case, multigraphs will be
              assigned to the Low and High  Surrogate  areas  (0xD800-0xDFFF).
              Note that this limits the number of multigraphs to 2,048.

       -P,--random-seed <seed>
              Set  the  seed for the random number generator. If not set here,
              it is set to a value determined by the time. The  seed  used  is
              reported in the log. This option allows runs to be replicated.

       -Q,--check-only
              Check  whether  the input is already sorted. Do not generate any
              output.  Exit status is 0 if input is already sorted, 11 if  not
              sorted.

       -1,--in <input file name>

       -2,--out <output file name>
              If the output file is the same as the input file, the input file
              will be overwritten. The input file will not be  overwritten  if
              the run is unsuccessful.

       -j,--suppress-log
              Suppress  output  to the log. If this flag is given before there
              is any output to the log from a command line flag, nothing  will
              be written to the log and the log file will not be created. If a
              command line flag generates a log message before  this  flag  is
              processed, the log file will be created but no log messages will
              be written to it once this flag is processed. To guarantee  that
              no  attempt  will  be  made  to  open a log file, give this flag
              first.

       -q,--quiet
              Be quiet - do not chat while working

       -u,--unicode-normalization <mode>
              Select Unicode normalization mode. The choices of  mode  are:  c
              for  normalization  form  C  (NFC),  d  for normalization form D
              (NFD), and n for no normalization. The default is NFC.

   Key specific options
       -e,--character-range <m,n>
              Sort on characters m through n. Positive indices start from one.
              Negative  indices  indicate  position with respect to the end of
              the record.  For example, the range 3,-2 consists of  the  third
              character through the next-to-last character.

       -n,--position <POS>(,<POS>)
              Sort  on  the specified POS or contiguous range of POSs, where a
              POS is of the form  <field  number>(.<character  number>).  Both
              counts  begin  at  one.  Field numbers but not character numbers
              may be negative, in which case they are counted from the  right.
              Thus,  1.2  is  the second character of the first field; -2.1 is
              the first character of the next to last field.

       -t,--tag <tag regexp>
              Sort on the field with the specified tag

       -o,--optional <comparison>
              Optional: compare as (<,=,>) to present key if absent

       -C,--fold-case
              Fold case

       -z,--fold-case-turkic
              Fold case with additional Turkic conversions.

       -c,--comparison-type <comparison type>
              a(ngle),l(exicographic), i(so8601  date/time),  t(ime),  D(omain
              name/email  address),  d(ate), m(onth name), n(umeric), N(umeric
              string),s(ize), h(hybrid), r(andom)

       -y,--number-system <number system>
              Specifies the number system expected for this key. This  affects
              only  numeric  and  numeric  string  keys. There are two special
              values. If the number system is "all", records may  contain  any
              number  system  that  msort can interpret. Different records may
              contain different number  systems.   If  the  number  system  is
              "any",  records  may  contain  any writing system that msort can
              interpret, but all records must make  use  of  the  same  number
              system.   msort sets the number system on the basis of the first
              record.

       -f,--date-format <date format>
              Permutation of ymd with separators, e.g. y-m-d for international
              date format, m/d/y for American date format, or a permutation of
              yd with separators, e.g. y-d, for day-of-year dates.  All  three
              components  may  be  numbers in any available number system. The
              month field may also be a month name,  determined  by  the  same
              devices as independent month name fields.

       -W,--sort-order-file-separators <file name>
              Read  the  list of characters to be treated as separators in the
              sort order definition file.

       -S,--substitutions <file name>
              Read substitutions from named file

       -s,--sort-order <file name>|<locale name>|"locale"
              If the argument is a file name, it is taken to be a  sort  order
              file  and  the  sort order for the key is read from the file. If
              the argument is a locale name,  the  collation  rules  for  that
              locale  are  used.  If  the  argument is "locale", the collation
              rules for the current locale are used.

       -T,--transformations <(d)(e)(s)>
              Apply  the  specified   transformations.    d   specifies   that
              diacritics  are  to  be  stripped.  Separately encoded combining
              diacritics are removed. Characters with  diacritics  represented
              by single  codepoints  are replaced with the corresponding ASCII
              character without the diacritics, if there is one.  e  specifies
              that  enclosed characters, that is, characters within circles or
              parentheses, are to be replaced  with  the  corresponding  plain
              ASCII character if there is one.  s specifies that characters in
              special styles are to be replaced with the  corresponding  plain
              ASCII  character if there is one. Stylistic equivalents include:
              small capitals (e.g. U+1D04), script forms (e.g. U+212C),  black
              letter  forms  (e.g.  U+212D),  Arabic  presentation forms (e.g.
              U+FE81), Hebrew  presentation  forms  (e.g.  U+FB1D),  fullwidth
              forms  (e.g.  U+FF01),  halfwidth  forms  (e.g. U+FF7B), and the
              mathematical alphanumeric symbols (e.g. U+1D400).

       -x,--exclusion-file <file name>
              Read exclusions from named file

       -X,--exclude-characters <exclusions>
              Exclude specified characters

       -i,--invert-locally
              Invert sense of comparisons

       -R,--reverse-key
              Reverse characters of key

       -A,--first-character-only
              Ignore  all  but  the  first  character  of  the  field,   after
              substitutions, exclusions, etc.

       Note: long options may not be available on your system.

SEE ALSO

       sort(1), uninum(3)

AUTHOR

       Bill Poser (billposer@alum.mit.edu)

LICENSE

       GNU   General  Public  License  (http://www.gnu.org/licenses/gpl.html),
       version 3.