Tuesday, December 09, 2014

Catastrophic Regular Expression

Today I was reading about catastrophic regular expression in java. Succinctly put, most programming language that includes regular expression support with greedy matching fail miserably against the following example:

Matcher: (a+a+) b
Sample string: aaaaa

The primary reason for this is matching logic tries different permutation as part of its backtracking when attempting a match. So for the aforementioned example, something to this effect takes place:

aaaaa, X => Fail
aaaa, a  => Fail backtrack
aaaa, aa => Fail backtrack
aaaa, a, aa => Fail
aaaa, aaa => Fail
aaaa, aa, a => Fail
aaaa, a, aa => Fail 

And so on.

You can see the various intermediate steps as part of this regex debugger.

One way to overcome this is by using possessive quantifier that does not do any backtracking. To explain further, Java supports three different types of quantifiers, viz., Greedy, Reluctant and Possessive quantifiers.

Greedy quantifier [syntax *]

Greedy quantifier forces the matcher to read in the entire input string before attempting the first match. In case if the first match attempts fails the matcher backs off by one character as seen in the example above and repeats the whole process.

Matcher: .*aaa
Sample String: aaaaaa

By definition greedy quantifier matches the entire string.


Reluctant Quantifier [syntax *?]

Reluctant quantifier takes an orthogonal strategy. It tries to match every character while searching for a match.

Matcher: .*?aaa
Sample String: aaaaaa

By definition reluctant quantifier would produce two matches, viz., aaa in {1, 3} and {4, 6}

Possessive Quantifier [syntax *+]

Possessive quantifier also takes in the entire input string but it tries only once for a match. Unlike greedy quantifiers, possessive quantifiers doesn't have a backtracking strategy.

Matcher: .*+aaa
Sample String: aaaaaa

No match is produced.