PCRE - Perl-compatible regular expressions

NAME

       PCRE - Perl-compatible regular expressions

PCRE BUILD-TIME OPTIONS


       This  document  describes  the  optional  features  of PCRE that can be
       selected when the library is compiled. It assumes use of the  configure
       script,  where  the  optional  features  are  selected or deselected by
       providing  options  to  configure  before  running  the  make  command.
       However,  the  same  options can be selected in both Unix-like and non-
       Unix-like environments using the GUI facility of cmake-gui if  you  are
       using CMake instead of configure to build PCRE.

       There  is  a  lot more information about building PCRE in non-Unix-like
       environments in the file called NON_UNIX_USE, which is part of the PCRE
       distribution.  You  should consult this file as well as the README file
       if you are building in a non-Unix-like environment.

       The complete list of options for configure (which includes the standard
       ones  such  as  the  selection  of  the  installation directory) can be
       obtained by running

         ./configure --help

       The following sections include  descriptions  of  options  whose  names
       begin with --enable or --disable. These settings specify changes to the
       defaults for the configure command. Because of the way  that  configure
       works,   --enable   and   --disable   always  come  in  pairs,  so  the
       complementary option always exists as well, but  as  it  specifies  the
       default, it is not described.

C++ SUPPORT


       By default, the configure script will search for a C++ compiler and C++
       header files. If it finds them, it automatically builds the C++ wrapper
       library for PCRE. You can disable this by adding

         --disable-cpp

       to the configure command.

UTF-8 SUPPORT


       To build PCRE with support for UTF-8 Unicode character strings, add

         --enable-utf8

       to  the  configure  command.  Of  itself, this does not make PCRE treat
       strings as UTF-8. As well as compiling PCRE with this option, you  also
       have  have to set the PCRE_UTF8 option when you call the pcre_compile()
       or pcre_compile2() functions.

       If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE
       expects its input to be either ASCII or UTF-8 (depending on the runtime
       option). It is not possible to support both EBCDIC and UTF-8  codes  in
       the  same  version  of  the  library.  Consequently,  --enable-utf8 and
       --enable-ebcdic are mutually exclusive.

UNICODE CHARACTER PROPERTY SUPPORT


       UTF-8 support allows PCRE to process character values greater than  255
       in  the  strings  that  it  handles.  On  its own, however, it does not
       provide any facilities for accessing the properties of such characters.
       If you want to be able to use the pattern escapes \P, \p, and \X, which
       refer to Unicode character properties, you must add

         --enable-unicode-properties

       to the configure command. This implies UTF-8 support, even if you  have
       not explicitly requested it.

       Including  Unicode  property  support  adds around 30K of tables to the
       PCRE library. Only the general category properties such as  Lu  and  Nd
       are supported. Details are given in the pcrepattern documentation.

CODE VALUE OF NEWLINE


       By  default,  PCRE interprets the linefeed (LF) character as indicating
       the end of a line. This is the normal newline  character  on  Unix-like
       systems.  You  can compile PCRE to use carriage return (CR) instead, by
       adding

         --enable-newline-is-cr

       to the  configure  command.  There  is  also  a  --enable-newline-is-lf
       option, which explicitly specifies linefeed as the newline character.

       Alternatively, you can specify that line endings are to be indicated by
       the two character sequence CRLF. If you want this, add

         --enable-newline-is-crlf

       to the configure command. There is a fourth option, specified by

         --enable-newline-is-anycrlf

       which causes PCRE to recognize any of the three sequences  CR,  LF,  or
       CRLF as indicating a line ending. Finally, a fifth option, specified by

         --enable-newline-is-any

       causes PCRE to recognize any Unicode newline sequence.

       Whatever line ending convention is selected when PCRE is built  can  be
       overridden  when  the library functions are called. At build time it is
       conventional to use the standard for your operating system.

WHAT \R MATCHES


       By default, the sequence \R in a pattern matches  any  Unicode  newline
       sequence,  whatever  has  been selected as the line ending sequence. If
       you specify

         --enable-bsr-anycrlf

       the default is changed so  that  \R  matches  only  CR,  LF,  or  CRLF.
       Whatever  is  selected  when  PCRE  is built can be overridden when the
       library functions are called.

BUILDING SHARED AND STATIC LIBRARIES


       The PCRE building process uses libtool to build both shared and  static
       Unix  libraries by default. You can suppress one of these by adding one
       of

         --disable-shared
         --disable-static

       to the configure command, as required.

POSIX MALLOC USAGE


       When PCRE is called through the  POSIX  interface  (see  the  pcreposix
       documentation),  additional working storage is required for holding the
       pointers to capturing substrings, because PCRE requires three  integers
       per  substring,  whereas  the POSIX interface provides only two. If the
       number of expected substrings is small, the wrapper function uses space
       on the stack, because this is faster than using malloc() for each call.
       The default threshold above which the stack is no longer used is 10; it
       can be changed by adding a setting such as

         --with-posix-malloc-threshold=20

       to the configure command.

HANDLING VERY LARGE PATTERNS


       Within  a  compiled  pattern,  offset values are used to point from one
       part to another  (for  example,  from  an  opening  parenthesis  to  an
       alternation  metacharacter).  By  default, two-byte values are used for
       these offsets, leading to a maximum size  for  a  compiled  pattern  of
       around  64K.  This  is  sufficient  to handle all but the most gigantic
       patterns. Nevertheless, some people do want to process  truyl  enormous
       patterns,  so it is possible to compile PCRE to use three-byte or four-
       byte offsets by adding a setting such as

         --with-link-size=3

       to the configure command. The value given must be 2,  3,  or  4.  Using
       longer  offsets slows down the operation of PCRE because it has to load
       additional bytes when handling them.

AVOIDING EXCESSIVE STACK USAGE


       When  matching  with  the   pcre_exec()   function,   PCRE   implements
       backtracking  by  making recursive calls to an internal function called
       match(). In environments where the size of the stack is  limited,  this
       can  severely  limit  PCRE’s  operation. (The Unix environment does not
       usually suffer from this problem, but it may sometimes be necessary  to
       increase  the  maximum  stack  size.   There  is  a  discussion  in the
       pcrestack documentation.) An alternative  approach  to  recursion  that
       uses  memory from the heap to remember data, instead of using recursive
       function calls, has been implemented  to  work  round  the  problem  of
       limited  stack  size. If you want to build a version of PCRE that works
       this way, add

         --disable-stack-for-recursion

       to the configure command. With this configuration, PCRE  will  use  the
       pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory
       management functions. By default these point to  malloc()  and  free(),
       but  you  can  replace the pointers so that your own functions are used
       instead.

       Separate functions are  provided  rather  than  using  pcre_malloc  and
       pcre_free  because  the  usage  is  very  predictable:  the block sizes
       requested are always the same, and  the  blocks  are  always  freed  in
       reverse  order.  A calling program might be able to implement optimized
       functions that perform better  than  malloc()  and  free().  PCRE  runs
       noticeably more slowly when built in this way. This option affects only
       the pcre_exec() function; it is not relevant for pcre_dfa_exec().

LIMITING PCRE RESOURCE USAGE


       Internally,  PCRE  has  a  function  called  match(),  which  it  calls
       repeatedly  (sometimes  recursively)  when  matching a pattern with the
       pcre_exec() function. By controlling the maximum number of  times  this
       function  may be called during a single matching operation, a limit can
       be placed on the resources used by a single call  to  pcre_exec().  The
       limit  can  be  changed  at  run  time,  as  described  in  the pcreapi
       documentation. The default is 10 million, but this can  be  changed  by
       adding a setting such as

         --with-match-limit=500000

       to   the   configure  command.  This  setting  has  no  effect  on  the
       pcre_dfa_exec() matching function.

       In some environments it is desirable to limit the  depth  of  recursive
       calls of match() more strictly than the total number of calls, in order
       to restrict the maximum amount of stack (or heap,  if  --disable-stack-
       for-recursion is specified) that is used. A second limit controls this;
       it defaults to the value that  is  set  for  --with-match-limit,  which
       imposes  no  additional constraints. However, you can set a lower limit
       by adding, for example,

         --with-match-limit-recursion=10000

       to the configure command. This value can  also  be  overridden  at  run
       time.

CREATING CHARACTER TABLES AT BUILD TIME


       PCRE  uses fixed tables for processing characters whose code values are
       less than 256. By default, PCRE is built with a set of tables that  are
       distributed  in  the  file pcre_chartables.c.dist. These tables are for
       ASCII codes only. If you add

         --enable-rebuild-chartables

       to the configure command, the distributed tables are  no  longer  used.
       Instead,  a  program  called dftables is compiled and run. This outputs
       the source for new set of tables, created in the default locale of your
       C runtime system. (This method of replacing the tables does not work if
       you are cross compiling, because dftables is run on the local host.  If
       you  need  to  create alternative tables when cross compiling, you will
       have to do so "by hand".)

USING EBCDIC CODE


       PCRE assumes by default that it will run in an  environment  where  the
       character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
       This is the  case  for  most  computer  operating  systems.  PCRE  can,
       however, be compiled to run in an EBCDIC environment by adding

         --enable-ebcdic

       to  the  configure  command.  This  setting  implies  --enable-rebuild-
       chartables. You should only use it if you  know  that  you  are  in  an
       EBCDIC  environment  (for  example, an IBM mainframe operating system).
       The --enable-ebcdic option is incompatible with --enable-utf8.

PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT


       By default, pcregrep reads all files as plain text. You can build it so
       that it recognizes files whose names end in .gz or .bz2, and reads them
       with libz or libbz2, respectively, by adding one or both of

         --enable-pcregrep-libz
         --enable-pcregrep-libbz2

       to the configure command. These  options  naturally  require  that  the
       relevant  libraries  are  installed  on your system. Configuration will
       fail if they are not.

PCRETEST OPTION FOR LIBREADLINE SUPPORT


       If you add

         --enable-pcretest-libreadline

       to the configure command,  pcretest  is  linked  with  the  libreadline
       library,  and  when its input is from a terminal, it reads it using the
       readline() function. This provides line-editing and history facilities.
       Note that libreadline is GPL-licensed, so if you distribute a binary of
       pcretest linked in this way, there may be licensing issues.

       Setting this option causes the -lreadline option to  be  added  to  the
       pcretest  build.  In many operating environments with a sytem-installed
       libreadline this is sufficient. However, in some environments (e.g.  if
       an  unmodified  distribution version of readline is in use), some extra
       configuration may be necessary. The INSTALL file for  libreadline  says
       this:

         "Readline uses the termcap functions, but does not link with the
         termcap or curses library itself, allowing applications which link
         with readline the to choose an appropriate library."

       If  your environment has not been set up so that an appropriate library
       is automatically included, you may need to add something like

         LIBS="-ncurses"

       immediately before the configure command.

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

REVISION


       Last updated: 29 September 2009
       Copyright (c) 1997-2009 University of Cambridge.

NAME

PCRE BUILD-TIME OPTIONS

C++ SUPPORT

UTF-8 SUPPORT

UNICODE CHARACTER PROPERTY SUPPORT

CODE VALUE OF NEWLINE

WHAT \R MATCHES

BUILDING SHARED AND STATIC LIBRARIES

POSIX MALLOC USAGE

HANDLING VERY LARGE PATTERNS

AVOIDING EXCESSIVE STACK USAGE

LIMITING PCRE RESOURCE USAGE

CREATING CHARACTER TABLES AT BUILD TIME

USING EBCDIC CODE

PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT

PCRETEST OPTION FOR LIBREADLINE SUPPORT

SEE ALSO

AUTHOR

REVISION