Man Linux: Main Page and Category List

NAME

       regcomp, regexec, regsub, regerror - regular expression handler

SYNOPSIS

       #include <regexp.h>

       regexp *regcomp(exp)
       char *exp;

       int regexec(prog, string)
       regexp *prog;
       char *string;

       regsub(prog, source, dest)
       regexp *prog;
       char *source;
       char *dest;

       regerror(msg)
       char *msg;

DESCRIPTION

       These   functions  implement  egrep(1)-style  regular  expressions  and
       supporting facilities.

       Regcomp compiles a regular expression into a structure of type  regexp,
       and  returns  a  pointer  to  it.   The  space has been allocated using
       malloc(3) and may be released by free.

       Regexec matches a NUL-terminated string against  the  compiled  regular
       expression  in  prog.   It returns 1 for success and 0 for failure, and
       adjusts the contents of prog’s startp and endp (see below) accordingly.

       The  members  of a regexp structure include at least the following (not
       necessarily in order):

              char *startp[NSUBEXP];
              char *endp[NSUBEXP];

       where NSUBEXP is defined (as 10) in the header file.  Once a successful
       regexec has been done using the regexp, each startp-endp pair describes
       one substring within the string, with the startp pointing to the  first
       character of the substring and the endp pointing to the first character
       following the substring.  The 0th substring is the substring of  string
       that  matched  the  whole  regular  expression.   The  others are those
       substrings that matched parenthesized expressions  within  the  regular
       expression,  with  parenthesized  expressions numbered in left-to-right
       order of their opening parentheses.

       Regsub copies source to dest, making  substitutions  according  to  the
       most  recent  regexec  performed  using  prog.  Each instance of ‘&’ in
       source is replaced by the substring indicated by startp[0] and endp[0].
       Each instance of ‘\n’, where n is a digit, is replaced by the substring
       indicated by startp[n] and endp[n].  To get a literal ‘&’ or ‘\n’  into
       dest,  prefix  it with ‘\’; to get a literal ‘\’ preceding ‘&’ or ‘\n’,
       prefix it with another ‘\’.

       Regerror is called whenever an error is detected in  regcomp,  regexec,
       or regsub.  The default regerror writes the string msg, with a suitable
       indicator of origin, on the standard error output and invokes  exit(2).
       Regerror can be replaced by the user if other actions are desirable.

REGULAR EXPRESSION SYNTAX

       A  regular  expression  is zero or more branches, separated by ‘|’.  It
       matches anything that matches one of the branches.

       A branch is zero or more pieces, concatenated.  It matches a match  for
       the first, followed by a match for the second, etc.

       A  piece  is  an  atom  possibly followed by ‘*’, ‘+’, or ‘?’.  An atom
       followed by ‘*’ matches a sequence of 0 or more matches  of  the  atom.
       An  atom followed by ‘+’ matches a sequence of 1 or more matches of the
       atom.  An atom followed by ‘?’ matches a match of the atom, or the null
       string.

       An  atom  is  a regular expression in parentheses (matching a match for
       the regular expression), a range (see below), ‘.’  (matching any single
       character), ‘^’ (matching the null string at the beginning of the input
       string), ‘$’ (matching the null string at the end of the input string),
       a  ‘\’  followed  by a single character (matching that character), or a
       single character with no other significance (matching that  character).

       A  range  is  a  sequence  of characters enclosed in ‘[]’.  It normally
       matches any single character from the sequence.  If the sequence begins
       with  ‘^’,  it  matches  any  single character not from the rest of the
       sequence.  If two characters in the sequence are separated by ‘-’, this
       is  shorthand  for the full list of ASCII characters between them (e.g.
       ‘[0-9]’ matches any decimal digit).  To include a literal  ‘]’  in  the
       sequence,  make  it the first character (following a possible ‘^’).  To
       include a literal ‘-’, make it the first or last character.

AMBIGUITY

       If a regular expression could match two different parts  of  the  input
       string,  it will match the one which begins earliest.  If both begin in
       the same place    but match different lengths, or match the same length
       in different ways, life gets messier, as follows.

       In  general,  the possibilities in a list of branches are considered in
       left-to-right order, the  possibilities  for  ‘*’,  ‘+’,  and  ‘?’  are
       considered  longest-first,  nested  constructs  are considered from the
       outermost in, and  concatenated  constructs  are  considered  leftmost-
       first.  The match that will be chosen is the one that uses the earliest
       possibility in the first choice that has to be made.  If there is  more
       than  one  choice,  the  next will be made in the same manner (earliest
       possibility) subject to the decision  on  the  first  choice.   And  so
       forth.

       For  example,  ‘(ab|a)b*c’  could  match ‘abc’ in one of two ways.  The
       first choice is between ‘ab’ and ‘a’; since ‘ab’ is earlier,  and  does
       lead  to  a  successful  overall match, it is chosen.  Since the ‘b’ is
       already spoken for, the ‘b*’ must match its last possibility—the  empty
       string—since it must respect the earlier choice.

       In  the particular case where no ‘|’s are present and there is only one
       ‘*’, ‘+’, or ‘?’, the net effect is that  the  longest  possible  match
       will  be  chosen.   So  ‘ab*’,  presented  with  ‘xabbbby’,  will match
       ‘abbbb’.  Note that if ‘ab*’ is  tried  against  ‘xabyabbbz’,  it  will
       match  ‘ab’  just  after  ‘x’,  due  to  the begins-earliest rule.  (In
       effect, the decision on where to start the match is the first choice to
       be  made,  hence  subsequent choices must respect it even if this leads
       them to less-preferred alternatives.)

SEE ALSO

       egrep(1), expr(1)

DIAGNOSTICS

       Regcomp  returns  NULL  for  a  failure  (regerror  permitting),  where
       failures   are  syntax  errors,  exceeding  implementation  limits,  or
       applying ‘+’ or ‘*’ to a possibly-null operand.

HISTORY

       Both code and manual page were written at U of T.  They are intended to
       be compatible with the Bell V8 regexp(3), but are not derived from Bell
       code.

BUGS

       Empty branches and empty regular expressions are not portable to V8.

       The restriction against applying ‘*’ or ‘+’ to a possibly-null  operand
       is an artifact of the simplistic implementation.

       Does  not  support egrep’s newline-separated branches; neither does the
       V8 regexp(3), though.

       Due to emphasis on compactness  and  simplicity,  it’s  not  strikingly
       fast.  It does give special attention to handling simple cases quickly.

                                 2 April 1986