Man Linux: Main Page and Category List

NAME

       sim  - find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda or
       text files

SYNOPSIS

       sim_c [ -[defFnpsS] -r N -w N -o F ] file ... [ / [ file ... ] ]
       sim_c ...
       sim_java ...
       sim_pasc ...
       sim_m2 ...
       sim_lisp ...
       sim_mira ...
       sim_text ...

DESCRIPTION

       Sim_c reads the C files file ...  and looks for pieces of text that are
       similar;  two pieces of program text are similar if they only differ in
       layout, comment, identifiers and the contents of numbers,  strings  and
       characters.   If  any  runs  of  sufficient  length are found, they are
       reported on standard output; the number of significant  tokens  in  the
       run is given between square brackets.

       Sim_java  does  the  same  for  Java,  sim_pasc  for Pascal, sim_m2 for
       Modula-2, sim_lisp for Lisp, and sim_mira for Miranda.  Sim_text  works
       on arbitrary text; it is occasionally useful on shell scripts.

       The  program  can  be  used  for  finding  copied  pieces  of  code  in
       purportedly  unrelated  programs  (with  -s  or  -S),  or  for  finding
       accidentally duplicated code in larger projects (with -f).

       If  a / is present between the input files, the latter are divided into
       a group of "new" files (before the /) and a group of  "old"  files;  if
       there  is  no  /, all files are "new".  Old files are never compared to
       each other.  Since the similarity tester reads the files several times,
       it cannot read from standard input.

       There are the following options:

       -d     The  output  is  in a diff(1)-like format instead of the default
              2-column format.

       -e     Each file is compared to each file in isolation; this will  find
              all  similarities  between  all  texts  involved,  regardless of
              duplicates.

       -f     Runs are restricted to pieces  with  balancing  parentheses,  to
              isolate  potential functions (C, Java, Pascal, Modula-2 and Lisp
              only).

       -F     The names of functions in calls are required  to  match  exactly
              (C, Java, Pascal, Modula-2 and Lisp only).

       -n     Similarities found are only summarized, not displayed.

       -o F   The output is written to the file named F.

       -p     The output is given in similarity percentages; see below.

       -r N   The minimum run length is set to N (default is N = 24).

       -s     The  contents  of  a  file  are not compared to itself (-s = not
              self).

       -S     The contents of the new files are compared to the old files only
              - not between themselves.

       -w N   The page width used is set to N columns (default is N = 80).

       The  -p  option results in lines of the form meaning that  % of ’s text
       can also be found in .  Note that this relation is not symmetric; it is
       in  fact  quite possible for one file to consist for 100 % of text from
       another file, while the other file consists for only 1 % of text of the
       first  file,  if  their  lengths  differ  enough.   Note  also that the
       granularity of the recognized text is still governed by the  -r  option
       or its default.

       Care has been taken to keep all internal processes linear in the length
       of the input, with the exception  of  the  matching  process  which  is
       almost  linear,  using  a hash table; various other tables are used for
       speed-up.  If, however, there is not enough memory for the tables, they
       are  discarded  in  order  of  unimportance, under which conditions the
       algorithms revert to their quadratic nature.

AUTHOR

       Dick Grune, Vrije Universiteit, Amsterdam.

BUGS

       Strong periodicity in  the  input  text  (like  a  table  of  N  almost
       identical  lines)  causes  problems.   Sim  tries to cope with this but
       cannot avoid giving appr. log N messages about it.  The best advice  is
       still to take the offending files out of the game.

       Since  it  uses  lex(1)  on some systems, it may dump core on any weird
       construction that overflows lex’s internal buffers.