Repeating a Capturing Group vs. Capturing a Repeat- 123docz.net

When creating a regular expression that needs a capturing group to grab part of the text matched, a common mistake is to repeat the capturing group instead of capturing a repeated group. The difference is that the repeated capturing group will capture only the last iteration, while a group capturing another group that’s repeated will capture all iterations. An example will make this clear. Let’s say you want to match a tag like

„!abc!” or „!123!”. Only these two are possible, and you want to capture the „abc” or „123” to figure out which tag you got. That’s easy enough: ô!(abc|123)!ằ will do the trick.

Now let’s say that the tag can contain multiple sequences of “abc” and “123”, like „!abc123!” or

„!123abcabc!”. The quick and easy solution is ô!(abc|123)+!ằ. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group.

When this regex matches „!abc123!”, the capturing group stores only „123”. When it matches

„!123abcabc!”, it only stores „abc”.

This is easy to understand if we look at how the regex engine applies ô!(abc|123)!ằ to “!abc123!”. First,

ô!ằ matches „!”. The engine then enters the capturing group. It makes note that capturing group #1 was entered when the engine reached the position between the first and second character in the subject string.

The first token in the group is ôabcằ, which matches „abc”. A match is found, so the second alternative isn’t tried. (The engine does store a backtracking position, but this won’t be used in this example.) The engine now leaves the capturing group. It makes note that capturing group #1 was exited when the engine reached the position between the 4th and 5th characters in the string.

After having exited from the group, the engine notices the plus. The plus is greedy, so the group is tried again. The engine enters the group again, and takes note that capturing group #1 was entered between the 4th and 5th characters in the string. It also makes note that since the plus is not possessive, it may be backtracked.

That is, if the group cannot be matched a second time, that’s fine. In this backtracking note, the regex engine also saves the entrance and exit positions of the group during the previous iteration of the group. ôabcằ fails to match “123”, but ô123ằ succeeds. The group is exited again. The exit position between characters 7 and 8 is stored.

The plus allows for another iteration, so the engine tries again. Backtracking info is stored, and the new entrance position for the group is saved. But now, both ôabcằ and ô123ằ fail to match “!”. The group fails, and the engine backtracks. While backtracking, the engine restores the capturing positions for the group.

Namely, the group was entered between characters 4 and 5, and existed between characters 7 and 8.

The engine proceeds with ô!ằ, which matches „!”. An overall match is found. The overall match spans the whole subject string. The capturing group spaces characters 5, 6 and 7, or „123”. Backtracking information is discarded when a match is found, so there’s no way to tell after the fact that the group had a previous iteration that matched „abc”. (The only exception to this is the .NET regex engine, which does preserve backtracking information for capturing groups after the match attempt.)

The solution to capturing „abc123” in this example should be obvious now: the regex engine should enter and leave the group only once. This means that the plus should be inside the capturing group rather than outside. Since we do need to group the two alternatives, we’ll need to place a second capturing group around the repeated group: ô!((abc|123)+)!ằ. When this regex matches „!abc123!”, capturing group #1 will store „abc123”, and group #2 will store „123”. Since we’re not interested in the inner group’s match, we can optimize this regular expression by making the inner group non-capturing: ô!((?:abc|123)+)!ằ.

Part 3

Tools & Languages

1. Specialized Tools and Utilities for Working with Regular Expressions

These tools and utilities have regular expressions as the core of their functionality.

grep - The utility from the UNIX world that first made regular expressions popular PowerGREP - Next generation grep for Microsoft Windows

RegexBuddy - Learn, create, understand, test, use and save regular expressions. RegexBuddy makes working with regular expressions easier than ever before.

General Applications with Notable Support for Regular Expressions

There are a lot of applications these days that support regular expressions in one way or another, enhancing certain part of their functionality. But certain applications stand out from the crowd by implementing a full- featured Perl-style regular expression flavor and allowing regular expressions to be used instead of literal search terms throughout the application.

EditPad Pro - Convenient text editor with a powerful regex-based search and replace feature, as well as regex- based customizable syntax coloring.

Programming Languages and Libraries

If you are a programmer, you can save a lot of coding time by using regular expressions. With a regular expression, you can do powerful string parsing in only a handful lines of code, or maybe even just a single line. A regex is faster to write and easier to debug and maintain than dozens or hundreds of lines of code to achieve the same by hand.

Delphi - Delphi does not have built-in regex support. Delphi for .NET can use the .NET framework regex support. For Win32, there are several PCRE-based VCL components available.

Java - Java 4 and later include an excellent regular expressions library in the java.util.regex package.

JavaScript - If you use JavaScript to validate user input on a web page at the client side, using JavaScript’s built-in regular expression support will greatly reduce the amount of code you need to write.

.NET (dot net) - Microsoft’s new development framework includes a poorly documented, but very powerful regular expression package, that you can use in any .NET-based programming language such as C# (C sharp) or VB.NET.

PCRE - Popular open source regular expression library written in ANSI C that you can link directly into your C and C++ applications, or use through an .so (UNIX/Linux) or a .dll (Windows).

Perl - The text-processing language that gave regular expressions a second life, and introduced many new features. Regular expressions are an essential part of Perl.

PHP - Popular language for creating dynamic web pages, with three sets of regex functions. Two implement POSIX ERE, while the third is based on PCRE.

POSIX - The POSIX standard defines two regular expression flavors that are implemented in many applications, programming languages and systems.

Python - Popular high-level scripting language with a comprehensive built-in regular expression library

REALbasic - Cross-platform development tool similar to Visual Basic, with a built-in RegEx class based on PCRE.

Ruby - Another popular high-level scripting language with comprehensive regular expression support as a language feature.

Tcl - Tcl, a popular “glue” language, offers three regex flavors. Two POSIX-compatible flavors, and an

“advanced” Perl-style flavor.

VBScript - Microsoft scripting language used in ASP (Active Server Pages) and Windows scripting, with a built-in RegExp object implementing the regex flavor defined in the JavaScript standard.

Visual Basic 6 - Last version of Visual Basic for Win32 development. You can use the VBScript RegExp object in your VB6 applications.

XML Schema - The W3C XML Schema standard defines its own regular expression flavor for validating simple types using pattern facets.

Databases

Modern databases often offer built-in regular expression features that can be used in SQL statements to filter columns using a regular expression. With some databases you can also use regular expressions to extract the useful part of a column, or to modify columns using a search-and-replace.

MySQL - MySQL’s REGEXP operator works just like the LIKE operator, except that it uses a POSIX Extended Regular Expression.

Oracle - Oracle Database 10g adds 4 regular expression functions that can be used in SQL and PL/SQL statements to filter rows and to extract and replace regex matches. Oracle implements POSIX Extended Regular Expressions.

PostgreSQL - PostgreSQL provides matching operators and extraction and substitution functions using the

“Advanced Regular Expression” engine also used by Tcl.

2. Using Regular Expressions with Delphi for .NET and Win32

Use System.Text.RegularExpressions with Delphi for .NET

When developing Borland Delphi WinForms and VCL.NET applications, you can access all classes that are part of the Common Language Runtime (CLR), including System.Text.RegularExpressions. Simply add this namespace to the uses clause, and you can access the .NET regex classes such as Regex, Match and Group.

You can use them with Delphi just as they can be used by C# and VB developers.

PCRE-based Components for Delphi for Windows/Win32

If your application is a good old Windows application using the Win32 API, you obviously cannot use the regex support from the .NET framework. Delphi itself does not provide a regular expression library, so you will need to use a third party VCL component. I recommend that you use a component that is based on the open source PCRE library. This is a very fast library, written in C. The regex syntax it supports is very complete. There are a few Delphi components that implement regular expressions purely in Delphi. Though that may sound like an advantage, the pure Delphi libraries I have seen do not support a full-featured modern regex syntax.

There are many PCRE-based VCL components available. Most are free, some are not. Some compile PCRE into a DLL that you need to ship along with your application, others link the PCRE OBJ files directly into your Delphi EXE.

One such component is TPerlRegEx, which I developed myself. You can download TPerlRegEx for free at http://www.regular-expressions.info/delphi.html. TPerlRegEx Delphi source, PCRE C sources, PCRE OBJ files and DLL are included. You can choose to link the OBJ files directly into your application, or to use the DLL. TPerlRegEx has full support for regex search-and-replace and regex splitting, which PCRE does not.

Full documentation is included with the download as a help file.

RegexBuddy’s Win32 Delphi code snippets are based on the TPerlRegEx component.

Repeating a Capturing Group vs. Capturing a Repeated Group

Start of String and End of String Anchors

Runaway Regular Expressions: Catastrophic Backtracking