descriptionPOSIX BREs & EREs via enhanced PCRE's pcreposix library
repository URLhttps://github.com/mackyle/pcreposix-compat.git
ownermackyle@gmail.com
last changeThu, 15 Oct 2020 08:34:58 +0000 (15 01:34 -0700)
last refreshFri, 17 May 2024 15:57:51 +0000 (17 17:57 +0200)
content tags
add:
README.md

PCRE POSIX Compat

What Is It?

POSIX compatibility updates for the "pcreposix" functionality provided with the PCRE library so that "pcreposix.h" and the corresponding library may be used as a standard POSIX "regex.h" substitute.

In addition pcre_jit_exec stubs are included when building with --disable-jit and when building with --enable-jit pcre_jit_exec will automatically fall back to pcre_exec when no JIT compilation is available. This makes the library work more seamlessly whether in JIT mode or not (perhaps because JIT cannot handle the specific pattern or because the platform in question does not allow creation of writable plus executable memory for the JIT compilation) even when callers are not using the JIT API entirely correctly.

See the "How to Use" section at the bottom if you're having a tl;dr moment.

Wherefore Art Thou?

Some poorly implemented regex engines go bananas over this:

w++++++++++++++++++++++

Technically that's not a POSIX ERE (using + requires an ERE not a BRE).

But this is:

(((((((((((((((((((((w+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+

You need only compile it with one of those engines like so:

grep -E '(((((((((((((((((((((w+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+)+'

to watch the bananas hit the fan.

PCRE does not suffer bananas easily and with a few, relatively simple, enhancements can be used as a substitute for the POSIX "regex.h" regcomp and regexec functions to avoid going bananas over nothing.

Good Timing Results for pcregrep

The PCRE distribution contains a pcregrep utility that works very much like grep except that all patterns are PCRE patterns (as long as -F is not used).

Here are some timing results for the problem regex shown above when passed to pcregrep. The "count" value along the "x" axis represents how many "+" quantifiers are being used in the pattern.

Some of these are running in VMs which is why these "good" timings are in the tens of milliseconds rather than much closer to 0.

These timing results confirm that using PCRE does not go bananas on these kinds of patterns. These are all "good" results.

Testing "pcregrep" on Linux 3.16.0-4-amd64 x86_64
|0.030| *     *
|     | =     =
|0.027| =     =
|     | =     =
|0.024| =     =
|     | =     =
|0.021| =     =
|     | =     =               *     *       *
|0.018| =     =               =     =       =
|     | =     =               =     =       =
|0.015| =     =               =     =       =
|     | =     =               =     =       =
|0.012| =     =               =     =       =
|     | =   * = * * * *   * * = * * = * * * = *   * * * *   * * * *
|0.009| =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|     | =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|0.006| =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|     | =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|0.003| =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|     | =   = = = = = =   = = = = = = = = = = =   = = = =   = = = =
|0.000| = * = = = = = = * = = = = = = = = = = = * = = = = * = = = =
| sec +------------------------------------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
|:::::|                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Testing "pcregrep" on FreeBSD 10.3-RELEASE-p4 amd64
|0.031|       *
|     |       =
|0.029|       =
|     |       =
|0.027|       =
|     |       =
|0.024|       =
|     |       =   *   *         *             *     *           *
|0.022|       =   =   =         =             =     =           =
|     |       =   =   =         =             =     =           =
|0.020|       =   =   =         =             =     =           =
|     |       =   =   =         =             =     =           =
|0.017|       =   =   =         =             =     =           =
|     | * * * = * = * = * * *   =   * * * * * = *   = * * * * * = *
|0.015| = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|     | = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|0.013| = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|     | = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|0.010| = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|     | = = = = = = = = = = =   =   = = = = = = =   = = = = = = = =
|0.008| = = = = = = = = = = = * = * = = = = = = = * = = = = = = = =
| sec +------------------------------------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
|:::::|                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Testing "pcregrep" on Darwin 9.8.0 i386
|0.010|         *           *           *         *   *           *
|     |         =           =           =         =   =           =
|0.009|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.008|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.007|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.006|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.005|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.004|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.003|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.002|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.001|         =           =           =         =   =           =
|     |         =           =           =         =   =           =
|0.000| * * * * = * * * * * = * * * * * = * * * * = * = * * * * * =
| sec +------------------------------------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
|:::::|                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Good and Bad grep Timing Results

There's only one "good" result here, but it should be obvious which one it is. That there's only one "good" result here instead of two will probably come as somewhat of a surprise to some folks, especially once the next section's results are viewed as well.

Testing "grep -E" on Linux 3.16.0-4-amd64 x86_64
|2.370|                               *
|     |                               =
|2.133|                               =
|     |                               =
|1.896|                               =
|     |                               =
|1.659|                               =
|     |                               =
|1.422|                               =
|     |                               =
|1.185|                             * =
|     |                             = =
|0.948|                             = =
|     |                             = =
|0.711|                             = =
|     |                             = =
|0.474|                           * = =
|     |                           = = =
|0.237|                         * = = =
|     |                     * * = = = =
|0.000| * * * * * * * * * * = = = = = =
| sec +--------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1
|:::::|                   0 1 2 3 4 5 6

Testing "grep -E" on FreeBSD 10.3-RELEASE-p4 amd64
|8.820|                               *
|     |                               =
|7.939|                               =
|     |                               =
|7.058|                               =
|     |                               =
|6.177|                               =
|     |                               =
|5.295|                               =
|     |                               =
|4.414|                               =
|     |                               =
|3.533|                               =
|     |                               =
|2.652|                               =
|     |                               =
|1.770|                               =
|     |                             * =
|0.889|                           * = =
|     |                         * = = =
|0.008| * * * * * * * * * * * * = = = =
| sec +--------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1
|:::::|                   0 1 2 3 4 5 6

Testing "grep -E" on Darwin 9.8.0 i386
|0.020|                                       *
|     |                                       =
|0.018|                                       =
|     |                                       =
|0.016|                                       =
|     |                                       =
|0.014|                                       =
|     |                                       =
|0.012|                                       =
|     |                                       =
|0.010|           *             *             =             *
|     |           =             =             =             =
|0.008|           =             =             =             =
|     |           =             =             =             =
|0.006|           =             =             =             =
|     |           =             =             =             =
|0.004|           =             =             =             =
|     |           =             =             =             =
|0.002|           =             =             =             =
|     |           =             =             =             =
|0.000| * * * * * = * * * * * * = * * * * * * = * * * * * * = * * *
| sec +------------------------------------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
|:::::|                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Good and Bad git Timing Results

There's only one "good" result here, but it should be obvious which one it is (and, no, it's not the same one as in the previous set). Again, it's a bit surprising there's only one "good" result here.

Testing "git log --grep" on Linux 3.16.0-4-amd64 x86_64
|2.330|                               *
|     |                               =
|2.097|                               =
|     |                               =
|1.864|                               =
|     |                               =
|1.631|                               =
|     |                               =
|1.398|                               =
|     |                               =
|1.165|                             * =
|     |                             = =
|0.932|                             = =
|     |                             = =
|0.699|                             = =
|     |                           * = =
|0.466|                           = = =
|     |                           = = =
|0.233|                         * = = =
|     |                       * = = = =
|0.000| * * * * * * * * * * * = = = = =
| sec +--------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1
|:::::|                   0 1 2 3 4 5 6

Testing "git log --grep" on FreeBSD 10.3-RELEASE-p4 amd64
|0.047|                                           *
|     |                                           =
|0.044|                                           =
|     |                                           =
|0.041|                                           =
|     |                                           =               *
|0.038|                                           =               =
|     |                                           =               =
|0.034|                                           =               =
|     |                                           =               =
|0.031|     *   * *   * * *   *   *     *         = * *     * *   =
|     |     =   = =   = = =   =   =     =         = = =     = =   =
|0.028|     =   = =   = = =   =   =     =         = = =     = =   =
|     |     =   = =   = = =   =   =     =         = = =     = =   =
|0.025|     =   = =   = = =   =   =     =         = = =     = =   =
|     | * * = * = =   = = = * = * =   * = * * * * = = = * * = = * =
|0.022| = = = = = =   = = = = = = =   = = = = = = = = = = = = = = =
|     | = = = = = =   = = = = = = =   = = = = = = = = = = = = = = =
|0.019| = = = = = =   = = = = = = =   = = = = = = = = = = = = = = =
|     | = = = = = =   = = = = = = =   = = = = = = = = = = = = = = =
|0.016| = = = = = = * = = = = = = = * = = = = = = = = = = = = = = =
| sec +------------------------------------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
|:::::|                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Testing "git log --grep" on Darwin 9.8.0 i386
|3.150|                                 *
|     |                                 =
|2.835|                                 =
|     |                                 =
|2.520|                                 =
|     |                                 =
|2.205|                                 =
|     |                                 =
|1.890|                                 =
|     |                                 =
|1.575|                                 =
|     |                               * =
|1.260|                               = =
|     |                               = =
|0.945|                               = =
|     |                               = =
|0.630|                             * = =
|     |                             = = =
|0.315|                           * = = =
|     |                         * = = = =
|0.000| * * * * * * * * * * * * = = = = =
| sec +----------------------------------
|:::::| 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1
|:::::|                   0 1 2 3 4 5 6 7

PCRE POSIX Means What?

With a name like "pcreposix" it could be taken to mean either of these:

  1. PCRE used as an implementation backend to provide POSIX semantics
  2. PCRE semantics provided via a POSIX-similar API

If you guessed (1), you're wrong. At least historically speaking, "pcreposix" has always meant (2) which can be a bit confusing since it provided neither Basic Regular Expression (BRE) support nor Extended Regular Expression (ERE) support via the regcomp and regexec functions in the "pcreposix.h" header, but has "posix" in its name and POSIX regcomp and regexec provide those.

What it did provide was Perl Compatible Regular Expression (PCRE) support via that header.

PCRE POSIX Compat

These compatibility updates for "pcreposix" update it to provide both (1) and (2) by introducing full support for POSIX BREs and EREs and providing a new REG_PCRE option for the regcomp function to activate PCRE support.

As a bonus a REG_JAVASCPT option is also provided to activate PCRE's Javascript Regular Expression (JRE) support.

All in one place with one API you get your BREs, EREs, PCREs and JREs.

In addition, if the library is built with --enable-jit and the platform allows creation of writable plus executable memory for JIT compilation, the JIT pattern compiler will automatically be used by regcomp when available no matter what options are passed to the regcomp function.

Backwards Compatibility

Not with older "pcreposix" libraries. The original version of "pcreposix.h" defines the REG_EXTENDED constant to be 0. This makes backwards compatibility for the "pcreposix" library impossible.

The version for the "pcreposix" library has been changed from a previous -version-info 0:4:0 to the current -version-info 1:0:0. What this means is that both the old and new "pcreposix" libraries can coexist on the same system. The old, original one will have a name like "libpcreposix.0.<ext>" while the new one will be named like "libpcreposix.1.<ext>".

The "pcreposix.h" header has also seen substantial changes and while an attempt has been made to make sure same-named constants keep the same value between the original version of "pcreposix.h" and the new one, in some cases that's just not possible. For example, REG_EXTENDED must not be 0.

To address this issue and also facilitate use as a "regex.h" substitute, the new "pcreposix.h" header and a new "regex.h" header (it's the same header) are now installed to "$includedir/pcreposix/pcreposix.h" and "$includedir/pcreposix/regex.h" whereas the original header was installed simply to "$includedir/pcreposix.h".

So again, the old and new versions of the "pcreposix.h" header can coexist on the same system. But now, to use the new "pcreposix" as a drop-in "regex.h" substitute all that's required is adding a "-I $includedir/pcreposix" option to the compile line and "-lpcreposix -lpcre" options to the link line.

Although if "libpcreposix.<ext>" is pointing at the old "libpcreposix.0.<ext>" library then that will have to be "-lpcreposix.1" instead.

The version of PCRE remains unchanged because:

  1. The only client of the new options is the new pcreposix
  2. Versions of PCRE not supporting the new options will complain about them causing the new pcreposix regcomp implementation to return an error
  3. Even if non-pcreposix clients make use of the newly provided PCRE options regex's compiled using them can be executed by any version of PCRE

Slight of Hand

So how does PCRE now, all of a sudden, support POSIX-semantics BREs and EREs?

Magic. A slight of hand. "Syntactic Sugar".

When the pattern is compiled in "BRE" mode, automagical virtual backslash escaping and unescaping takes place to transform the incoming BRE into an equivalent PCRE.

To a lesser extent the same thing happens in "ERE" mode but really only in the one case ([^foo] does not match \n when one of the new options is enabled). But again, more slight of hand (an extra \n is virtually inserted into the negated character class to get the desired semantics).

There are five new PCRE options in all. (The other three also perform more slight-of-hand to provide POSIX REG_EXTENDED, BSD REG_PEND and BSD REG_NOSPEC compatibility.)

The bottom line is that saving off a compiled PCRE regular expression that used one (or more) of the new options, loading it on a system without support for the new options and "exec"ing it works and matches what it's supposed to.

The rest of the POSIX semantics compatibility is provided by having the new "pcreposix" library manipulate other PCRE options (PCRE_DOTALL, PCRE_DOLLARENDOLY and PCRE_MULTILINE) based on the options supplied to the regcomp function.

Full details are in the "pcreposix.h" header, but basically, sticking to only POSIX-defined options results in POSIX semantics. Using either the new REG_PCRE or REG_JAVASCPT non-POSIX options gets you either PCREs or JREs respectively.

The "*BSD" compatibility options (REG_PEND, REG_NOSPEC and REG_STARTEND) are supported in all regex modes (but REG_NOSPEC acts like its own mode).

New pcreposix Old pcre

So what happens if a new "pcreposix" library is linked against an old "pcre" library?

It depends.

If the "pcreposix" client has set REG_PCRE and/or REG_JAVASCPT and avoids using REG_NOSPEC and REG_PEND then everything else will work.

Alternatively, if the "pcreposix" client sets REG_EXTENDED and does NOT set any of REG_NEWLINE, REG_NOSPEC or REG_PEND it will also still work.

If none of REG_EXTENDED, REG_PCRE or REG_JAVASCPT are set or either of REG_PEND or REG_NOSPEC is set it will fail at regcomp time with a REG_INVARG error.

The fact that "pcreposix" non-BRE and non-REG_NEWLINE functionality works fully and correctly with an "old" "-lpcre" library and returns an error rather than giving incorrect results for the rest is the reason that the PCRE library version remains unchanged by these compatibility updates.

An application that links against a shared "pcre" library and needs to use one or more of the options that require the enhanced "pcre" library should simply attempt to compile a simple pattern using the needed options at startup and bail out immediately if it gets a REG_INVARG result.

What's this JIT?

The PCRE library has the ability (if built with --enable-jit) to compile pattern matching patterns directly into machine code on some platforms. This "Just In Time" pattern compilation requires:

  1. The PCRE library must be built with the --enable-jit configuration option
  2. The platform must permit creation of PROT_WRITE | PROT_EXEC memory
  3. The specific pattern and options must be supported by the JIT compiler
  4. Extra JIT API calls must be used

The regcomp and regexec implementations included in this repository automatically take care of 2-4 with automatic, silent fallback to non-JIT when JIT is not supported for whatever reason (including disallowed write + exec).

Simply add the --enable-jit option when running ./configure and the pcreposix library will automatically use JIT pattern matching when available.

If the platform in question simply does not allow JIT at all (see here and here for some discussion about this), it's more efficient to just build the library with the default --disable-jit configuration option.

If building for Mac OS X against an older version of the Mac OS X SDK with --enable-jit and the intent is to potentially use the resulting code on newer versions of OS X, the following defines should be added at configure time:

CPPFLAGS="-DTARGET_OS_OSX=1 -DMAP_JIT=0x0800" ./configure --enable-jit ...

Newer SDKs would provide those exact values and the PCRE JIT runtime is smart enough to only use MAP_JIT when the OS version it's running on supports it.

How to Use

Build and install like normal for PCRE (e.g. run "./configure" then "make" and then "make install").

Remember to run "./configure" with --enable-jit if you want the JIT pattern compiler support to be present (see the previous section).

This repository already contains the necessary pre-generated "configure" and patches pre-applied to the PCRE 8.44 tarball release. Clone it, download a tarball of it or see the "pcreposix-compat-patches" branch for individual patches to apply to a PCRE 8.44 tarball yourself.

Clients that explicitly need to use non-POSIX options should include the header as #include <pcreposix/pcreposix.h> and make sure they set the REG_PCRE and/or REG_JAVASCPT option bit(s) for regcomp to get PCRE or JRE support.

For use as a drop-in substitute for the POSIX "regex.h", clients should just continue to #include <regex.h> and then compile with the proper include "-I $includedir/pcreposix" option.

In any case, clients of "pcreposix" must link against BOTH "-lpcreposix" and "-lpcre" (wherever they happen to be).

If you wish to use this as a substitute "regex.h" when building Git, see the accompanying "README-GIT-REGEX" and "config.mak" files for assistance with doing that.

See the "New pcreposix Old pcre" section above for a discussion of what happens when you link against the new "pcreposix" library and an older "pcre" library. In some cases that may be a desirable scenario where a static version of the new "pcreposix" library is linked into an application which is then linked against an older shared "pcre" library. Read the "New pcreposix Old pcre" section for full details.

License

These updates and enhancements are licensed under the same terms as PCRE itself.

Project Home Page

https://github.com/mackyle/pcreposix-compat

shortlog
2020-10-15 Kyle J. McKaytgupdate: merge pcreposix-compat base into pcreposix... pcreposix-compatpcreposix-compat-8.44
2020-10-15 Kyle J. McKaytgupdate: merge t/compat-readme into pcreposix-compat... {top-bases}/pcreposix-compat
2020-10-15 Kyle J. McKayREADME.txt: update for pcre-8.44t/compat-readme
2020-06-14 Kyle J. McKaytgupdate: merge pcreposix-compat base into pcreposix...
2020-06-14 Kyle J. McKaytgupdate: merge t/compat-readme into pcreposix-compat...
2020-06-14 Kyle J. McKaytgupdate: merge t/compat-readme base into t/compat...
2020-06-14 Kyle J. McKaytgupdate: merge t/compat-version into t/compat-readme... {top-bases}/t/compat-readme
2020-06-14 Kyle J. McKaytgupdate: merge t/compat-version base into t/compat... t/compat-version
2020-06-14 Kyle J. McKaytgupdate: merge t/auto-jit into t/compat-version base{top-bases}/t/compat-version
2020-06-14 Kyle J. McKaytgupdate: merge t/auto-jit base into t/auto-jitt/auto-jit
2020-06-14 Kyle J. McKaytgupdate: merge t/map-anon-fix into t/auto-jit base{top-bases}/t/auto-jit
2020-06-14 Kyle J. McKay.topdeps: add new dependency t/map-anon-fix
2020-06-14 Kyle J. McKaysljit/sljitExecAllocator.c: use MAP_ANONt/map-anon-fix
2020-06-14 Kyle J. McKaytg create t/map-anon-fix
2020-06-14 Kyle J. McKaytg create t/map-anon-fix base{top-bases}/t/map-anon-fix
2020-06-14 Kyle J. McKaytgupdate: merge pcreposix-compat base into pcreposix...
...
tags
3 years ago pcreposix-compat-patches-8.44 pcreposix-compat-patches-8.44:...
3 years ago pcreposix-compat-8.44 pcreposix-compat-8.44: PCRE 8.44...
3 years ago pcre-8.44 pcre-8.44.tar.gz
4 years ago pcreposix-compat-patches-8.43 pcreposix-compat-patches-8.43:...
4 years ago pcreposix-compat-8.43 pcreposix-compat-8.43: PCRE 8.43...
4 years ago pcre-8.43 pcre-8.43.tar.gz
6 years ago pcreposix-compat-patches-8.42 pcreposix-compat-patches-8.42:...
6 years ago pcreposix-compat-8.42 pcreposix-compat-8.42: PCRE 8.42...
6 years ago pcre-8.42 pcre-8.42.tar.gz
6 years ago pcreposix-compat-patches-8.41 pcreposix-compat-patches-8.41:...
6 years ago pcreposix-compat-8.41 pcreposix-compat-8.41: PCRE 8.41...
6 years ago pcre-8.41 pcre-8.41.tar.gz
7 years ago pcreposix-compat-patches-8.40 pcreposix-compat-patches-8.40:...
7 years ago pcreposix-compat-8.40 pcreposix-compat-8.40: PCRE 8-40...
7 years ago pcre-8.40 pcre-8.40.tar.gz
heads
3 years ago pcreposix-compat-patches
3 years ago {top-bases}/pcreposix-compat
3 years ago pcreposix-compat
3 years ago t/compat-readme
3 years ago {top-bases}/t/compat-version
3 years ago {top-bases}/t/compat-readme
3 years ago t/compat-version
3 years ago t/auto-jit
3 years ago {top-bases}/t/auto-jit
3 years ago t/map-anon-fix
3 years ago {top-bases}/t/map-anon-fix
3 years ago t/pcre-jit-stubs
3 years ago {top-bases}/t/pcre-jit-stubs
3 years ago t/posix-regoff-type
3 years ago {top-bases}/t/posix-regoff-type
3 years ago {top-bases}/t/posix-defines-not-enum
...