PCRE - Perl-compatible regular expressions

NAME

       PCRE - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE


       In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
       pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
       short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
       There are circumstances where it might be helpful to  distinguish  this
       case from other cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user’s keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
       reflecting the  character  that  has  been  typed,  for  example.  This
       immediate feedback is likely to be a better user interface than a check
       that is delayed until the  entire  string  has  been  entered.  Partial
       matching  can  also sometimes be useful when the subject string is very
       long and is not all available at once.

       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
       PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
       pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
       for PCRE_PARTIAL_SOFT. The essential difference between the two options
       is whether or not a  partial  match  is  preferred  to  an  alternative
       complete  match,  though  the  details  differ between the two matching
       functions. If both options are set, PCRE_PARTIAL_HARD takes precedence.

       Setting a partial matching option disables two of PCRE’s optimizations.
       PCRE remembers the  last  literal  byte  in  a  pattern,  and  abandons
       matching  immediately  if  such  a  byte  is not present in the subject
       string. This optimization cannot be used  for  a  subject  string  that
       might  match only partially. If the pattern was studied, PCRE knows the
       minimum length of a matching string, and does not  bother  to  run  the
       matching  function  on  shorter  strings.  This  optimization  is  also
       disabled for partial matching.

PARTIAL MATCHING USING pcre_exec()


       A partial match occurs during a call to pcre_exec() whenever the end of
       the  subject  string  is  reached  successfully,  but  matching  cannot
       continue because more characters are  needed.  However,  at  least  one
       character  must have been matched. (In other words, a partial match can
       never be an empty string.)

       If PCRE_PARTIAL_SOFT is set,  the  partial  match  is  remembered,  but
       matching continues as normal, and other alternatives in the pattern are
       tried.  If  no  complete  match  can  be  found,  pcre_exec()   returns
       PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
       two slots in the offsets vector, the first of them is set to the offset
       of the earliest character that was inspected when the partial match was
       found. For convenience, the second offset points  to  the  end  of  the
       string so that a substring can easily be identified.

       For  the majority of patterns, the first offset identifies the start of
       the partially  matched  string.  However,  for  patterns  that  contain
       lookbehind  assertions,  or  \K,  or  begin  with  \b  or  \B,  earlier
       characters have been  inspected  while  carrying  out  the  match.  For
       example:

         /(?<=abc)123/

       This pattern matches "123", but only if it is preceded by "abc". If the
       subject string is "xyzabc12", the offsets after a partial match are for
       the  substring  "abc12",  because  all  these  characters are needed if
       another match is tried with extra characters added.

       If there is more than one partial match, the first one that  was  found
       provides the data that is returned. Consider this pattern:

         /123\w+X|dogY/

       If  this  is  matched  against  the  subject  string  "abc123dog", both
       alternatives fail to match, but the  end  of  the  subject  is  reached
       during   matching,   so   PCRE_ERROR_PARTIAL  is  returned  instead  of
       PCRE_ERROR_NOMATCH. The  offsets  are  set  to  3  and  9,  identifying
       "123dog"  as  the first partial match that was found. (In this example,
       there are two partial matches,  because  "dog"  on  its  own  partially
       matches the second alternative.)

       If    PCRE_PARTIAL_HARD    is   set   for   pcre_exec(),   it   returns
       PCRE_ERROR_PARTIAL as  soon  as  a  partial  match  is  found,  without
       continuing  to  search  for  possible  complete matches. The difference
       between the two options can be illustrated by a pattern such as:

         /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
       the  longer  string  if  possible). If it is matched against the string
       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
       On the other hand, if the  pattern  is  made  ungreedy  the  result  is
       different:

         /dog(sbody)??/

       In  this case the result is always a complete match because pcre_exec()
       finds that first, and it never continues  after  finding  a  match.  It
       might  be  easier  to  follow  this  explanation by thinking of the two
       patterns like this:

         /dog(sbody)?/    is the same as  /dogsbody|dog/
         /dog(sbody)??/   is the same as  /dog|dogsbody/

       The second pattern will never  match  "dogsbody"  when  pcre_exec()  is
       used, because it will always find the shorter match first.

PARTIAL MATCHING USING pcre_dfa_exec()


       The  pcre_dfa_exec()  function moves along the subject string character
       by character, without backtracking, searching for all possible  matches
       simultaneously.  If the end of the subject is reached before the end of
       the pattern, there  is  the  possibility  of  a  partial  match,  again
       provided that at least one character has matched.

       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
       there have been no complete matches. Otherwise,  the  complete  matches
       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
       takes precedence over any complete matches. The portion of  the  string
       that  was  inspected when the longest partial match was found is set as
       the first matching string, provided there are at least two slots in the
       offsets vector.

       Because  pcre_dfa_exec()  always searches for all possible matches, and
       there is no difference between  greedy  and  ungreedy  repetition,  its
       behaviour  is  different  from pcre_exec when PCRE_PARTIAL_HARD is set.
       Consider the string "dog" matched against the  ungreedy  pattern  shown
       above:

         /dog(sbody)??/

       Whereas  pcre_exec()  stops  as soon as it finds the complete match for
       "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
       so returns that when PCRE_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES


       If  a  pattern ends with one of sequences \b or \B, which test for word
       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
       intuitive results. Consider this pattern:

         /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following  character  cannot  take  place, so a partial match is found.
       However, pcre_exec() carries on with normal matching, which matches  \b
       at  the  end  of  the subject when the last character is a letter, thus
       finding   a   complete   match.   The   result,   therefore,   is   not
       PCRE_ERROR_PARTIAL.   The  same  thing  happens  with  pcre_dfa_exec(),
       because it also finds the complete match.

       Using PCRE_PARTIAL_HARD in this  case  does  yield  PCRE_ERROR_PARTIAL,
       because then the partial match takes precedence.

FORMERLY RESTRICTED PATTERNS


       For releases of PCRE prior to 8.00, because of the way certain internal
       optimizations  were  implemented  in  the  pcre_exec()  function,   the
       PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
       used with all patterns. From release 8.00 onwards, the restrictions  no
       longer  apply,  and  partial matching with pcre_exec() can be requested
       for any pattern.

       Items that were formerly restricted were repeated single characters and
       repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did
       not conform to the restrictions, pcre_exec() returned  the  error  code
       PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
       PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
       pattern can be used for partial matching now always returns 1.

EXAMPLE OF PARTIAL MATCHING USING PCRETEST


       If  the  escape  sequence  \P  is  present in a pcretest data line, the
       PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of
       pcretest that uses the date example quoted above:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\P
          0: 25jun04
          1: jun
         data> 25dec3\P
         Partial match: 23dec3
         data> 3ju\P
         Partial match: 3ju
         data> 3juj\P
         No match
         data> j\P
         No match

       The  first  data  string  is  matched completely, so pcretest shows the
       matched substrings.  The  remaining  four  strings  do  not  match  the
       complete pattern, but the first two are partial matches. Similar output
       is obtained when pcre_dfa_exec() is used.

       If the escape sequence \P is present more than once in a pcretest  data
       line, the PCRE_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()


       When  a  partial  match  has  been  found  using pcre_dfa_exec(), it is
       possible to continue the match by providing additional subject data and
       calling   pcre_dfa_exec()   again   with   the  same  compiled  regular
       expression, this time setting the  PCRE_DFA_RESTART  option.  You  must
       pass the same working space as before, because this is where details of
       the previous partial  match  are  stored.  Here  is  an  example  using
       pcretest,  using  the  \R  escape  sequence to set the PCRE_DFA_RESTART
       option (\D specifies the use of pcre_dfa_exec()):

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 23ja\P\D
         Partial match: 23ja
         data> n05\R\D
          0: n05

       The first  call  has  "23ja"  as  the  subject,  and  requests  partial
       matching;  the  second  call has "n05" as the subject for the continued
       (restarted) match.  Notice that when the match is  complete,  only  the
       last  part  is  shown;  PCRE  does not retain the previously partially-
       matched string. It is up to the calling program to do that if it  needs
       to.

       You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
       PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
       This  facility  can  be  used  to  pass  very  long  subject strings to
       pcre_dfa_exec().

MULTI-SEGMENT MATCHING WITH pcre_exec()


       From release 8.00, pcre_exec() can also be  used  to  do  multi-segment
       matching.  Unlike  pcre_dfa_exec(),  it  is not possible to restart the
       previous match with a new segment of data. Instead, new  data  must  be
       added  to  the  previous  subject  string, and the entire match re-run,
       starting from the point where the partial match occurred. Earlier  data
       can be discarded.  Consider an unanchored pattern that matches dates:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> The date is 23ja\P
         Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next segment, and call pcre_exec()  again.  Unlike
       pcre_dfa_exec(),  the  entire matching string must always be available,
       and the complete matching process occurs for each call, so more  memory
       and more processing time is needed.

       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
       with \b or \B, the string that is returned for  a  partial  match  will
       include  characters  that  precede the partially matched string itself,
       because these must be retained when adding on  more  characters  for  a
       subsequent matching attempt.

ISSUES WITH MULTI-SEGMENT MATCHING


       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains tests for the beginning or end  of  a  line,
       you   need   to   pass  the  PCRE_NOTBOL  or  PCRE_NOTEOL  options,  as
       appropriate, when the subject string for any call does not contain  the
       beginning or end of a line.

       2.  Lookbehind  assertions at the start of a pattern are catered for in
       the offsets that are returned for a partial match. However, in  theory,
       a  lookbehind assertion later in the pattern could require even earlier
       characters to be inspected, and it might not have been reached  when  a
       partial  match occurs. This is probably an extremely unlikely case; you
       could guard against it to a certain extent by  always  including  extra
       characters at the start.

       3.  Matching  a subject string that is split into multiple segments may
       not always produce exactly the same result as matching over one  single
       long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section
       "Partial Matching and Word Boundaries" above describes  an  issue  that
       arises  if  the  pattern ends with \b or \B. Another kind of difference
       may occur when there are multiple  matching  possibilities,  because  a
       partial match result is given only when there are no completed matches.
       This means  that  as  soon  as  the  shortest  match  has  been  found,
       continuation  to a new subject segment is no longer possible.  Consider
       again this pcretest example:

           re> /dog(sbody)?/
         data> dogsb\P
          0: dog
         data> do\P\D
         Partial match: do
         data> gsb\R\P\D
          0: g
         data> dogsbody\D
          0: dogsbody
          1: dog

       The first data line passes the string "dogsb" to  pcre_exec(),  setting
       the  PCRE_PARTIAL_SOFT  option.  Although the string is a partial match
       for "dogsbody", the  result  is  not  PCRE_ERROR_PARTIAL,  because  the
       shorter  string  "dog" is a complete match. Similarly, when the subject
       is presented to pcre_dfa_exec() in several parts ("do" and "gsb"  being
       the first two) the match stops when "dog" has been found, and it is not
       possible to continue. On the other hand, if "dogsbody" is presented  as
       a single string, pcre_dfa_exec() finds both matches.

       Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
       when matching  multi-segment  data.  The  example  above  then  behaves
       differently:

           re> /dog(sbody)?/
         data> dogsb\P\P
         Partial match: dogsb
         data> do\P\D
         Partial match: do
         data> gsb\R\P\P\D
         Partial match: gsb

       4. Patterns that contain alternatives at the top level which do not all
       start with the  same  pattern  item  may  not  work  as  expected  when
       PCRE_DFA_RESTART  is  used  with pcre_dfa_exec(). For example, consider
       this pattern:

         1234|3789

       If the first part of the subject is "ABC123", a partial  match  of  the
       first  alternative  is found at offset 3. There is no partial match for
       the second alternative, because such a match does not start at the same
       point  in  the  subject  string. Attempting to continue with the string
       "7890" does not yield a match  because  only  those  alternatives  that
       match  at  one  point in the subject are remembered. The problem arises
       because the start of the second alternative matches  within  the  first
       alternative.  There  is  no  problem with anchored patterns or patterns
       such as:

         1234|ABCD

       where no string can be a partial match for both alternatives.  This  is
       not  a  problem if pcre_exec() is used, because the entire match has to
       be rerun each time:

           re> /1234|3789/
         data> ABC123\P
         Partial match: 123
         data> 1237890
          0: 3789

       Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
       running the entire match can also be used with pcre_dfa_exec(). Another
       possibility is to work with two buffers. If a partial match at offset n
       in  the first buffer is followed by "no match" when PCRE_DFA_RESTART is
       used on the second buffer, you can then try a  new  match  starting  at
       offset n+1 in the first buffer.

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

REVISION


       Last updated: 19 October 2009
       Copyright (c) 1997-2009 University of Cambridge.