Win32 Port of Henry Spencer's Reg Library

History

"The old one, known as regexp, is simple and clean, but a bit slow and not POSIX-compliant.
"The one that shipped with 4.4BSD, regex, is POSIX compliant, but big and ugly and also slow.
"The newest one, reg, currently is found only in the latest Tcl distribution (version 8.2x)."

The latest version of Henry's code is a complete rewrite to support Unicode and Advanced Regular Expressions (AREs) as defined in Perl. AREs support things like character classes, e.g. \s and [:space:] match whitespace, and non-greedy matching, e.g. "Hello.*?World" over  "Hello, World, World, World!" matches "Hello, World" not "Hello, World, World, World". AREs are described in the Regular Expression Syntax man page provided with this distribution as well as Mastering Regular Expressions from O'Reilly.

For simplicity, my current port does not support Unicode. Ultimately I'd like A and W versions of all of the regex entry points in the style of Win32, but that hasn't happened yet.

Contact

Chris Sells, csells@sellsbrothers.com, http://www.sellsbrothers.com.

Advanced Regular Expression Syntax

See the Regular Expression Syntax man page for a description of Henry's implementation of AREs.

Porting Notes

See the porting notes for how reg was ported to Win32.

Performance

AREs, with their expanded functionality, are about 10x slower than BREs. However, feel free to specify the REG_BASIC flag (as described below) to designate the use of BRE if you do not need ARE functionality.

Build

To build, open the reg.dsw project and build either the Release or Debug build, which will produce a regex.lib or regexd.lib file respectively. Upon successfully building the library, the post-build step will copy the following files into a peer directory (not a sub-directory) called common:

The file regex_class.h is also found in the common directory. The CRegex class it defines is discussed below.

Usage

    // stdafx.h
    #include <regex.h>
    #ifdef _DEBUG
    #pragma comment(lib, "regexd.lib")
    #else
    #pragma comment(lib, "regex.lib")
    #endif
    // recli.h
    #include "stdafx.h"
    const char* pszRE = "Hello.*?World";
    const char* pszToMatch = "Hello, World, World, World!";

    regex_t re = { 0 };
    if( regcomp(&re, pszRE, REG_ADVANCED) ) return -1;

    regmatch_t  rgMatches[11];
    if( regexec(&re, pszToMatch, lengthof(rgMatches), rgMatches, 0) ) return -1;
    regfree(&re);

    char    sz[256];
    strncpy(sz, pszToMatch + rgMatches[0].rm_so, rgMatches[0].rm_eo - rgMatches[0].rm_so);
    sz[rgMatches[0].rm_eo - rgMatches[0].rm_so] = 0;
    printf("match: '%s'\n", sz);

C++ Class Wrapper

A C++ class called CRegex, which wraps Henry's regex_t and regmatch_t structures, is included with this distribution in the regex_class.h file. It's meant to be used like so:

    // stdafx.h
    #include "regex_class.h" // regex.lib automatically added to linker line in VC6
    // recli.h
    #include "stdafx.h"
    const char* pszRE = "Hello.*?World";
    const char* pszToMatch = "Hello, World, World, World!";

    CRegex re;
    if( !re.Compile(pszRE) ) return -1;
    if( !re.Match(pszToMatch) ) return -1;
    printf("match: '%s'\n", re2[0].c_str());

Compile and Execute Flags

As mentioned in Regular Expression Syntax, there are a number of flags you can embed into the regex string itself. If you prefer, you can pass flags separately from the regex string as defined in regex.h, as shown below:

CRegex::CRegex/CRegex::Compile/regcomp Flags:

Flag Name Meaning
REG_BASIC Basic Regular Expressions (BREs)
REG_EXTENDED Extended Regular Expressions (EREs)
REG_ADVF Advanced features in EREs
REG_ADVANCED Advanced Regular Expressions (AREs)
REG_QUOTE No special characters
REG_ICASE Ignore case
REG_NOSUB Don't care about sub-expressions
REG_EXPANDED Expanded format, white space & comments
REG_NLSTOP \n doesn't match . or [^ ]
REG_NLANCH ^ matches after \n, $ before
REG_NEWLINE Newlines are line terminators
REG_EXPECT Report details on partial/limited matches

CRegex::Match/regexec Flags:

Flag Name Meaning
REG_NOTBOL Beginning of string (BOS) is not beginning of line (BOL)
REG_NOTEOL End of string (EOS) is not end of line (EOL)

License

All modifications to the "reg" library, including extras provided for Win32 and C++, are provided under the same license as the "reg" library itself.