Regular Expressions: The Rules

The following are the rules, a non-POSIX regular expression engine(such as in PERL, JAVA, etc ) would adhere to while attempting to match with the string,

Notation: the examples would list the given regex(pattern) , the string tested against (string) and the actual match happened in the string in between ‘<<<‘ and ‘>>>’.

1. The match that begins earliest/leftmost wins.

The intention is to match the cat at the end but the ‘cat’ in the catalogue won the match as it appears leftmost in the string.

pattern : cat

string :This catalogue has the names of different species of cat.

Matched: This <<< cat >>> alogue has the names of different species of cat.

1a.The leftmost match in the string wins, irrespective of the order a pattern appears in alternation

Though last in the alternation, ‘catalogue’ got the match as it appeared leftmost among the patterns in the alternation.

pattern :species|names|catalogue

string :This catalogue has the names of different species of cat.

Matched: This <<< catalogue >>>  has the names of different species of cat.

1b. If there are more than one plausible match occurs in the same position, then the order of the plausible matching patterns in the alternation counts.

All three patterns have a possible match at the same position, but ‘over’ is successful as it appeared first in the alternation.

pattern : over|o|overnight

string :Actually, I’m an overnight success. But it took twenty years.

Matched: Actually, I’m an <<< over >>> night success. But it took twenty years.


2. The standard quantifiers (* +, ? and {m,n}) are greedy

Greediness (*,+,?) would always try to match more before it tries to match minimum characters needed for the match to be successful ( ‘0’ for *,? ; ‘1’ for + )

The intention is to match the “Joy is prayer”, though .* went pass across all the double quotes and grabbing all the strings only to match the last double quote (“).

pattern :”.*”

string :”Joy is prayer”.”Joy is strength”.”Joy is Love”.

Matched: <<< “Joy is prayer”.”Joy is strength”.”Joy is Love” >>> .

2a. Lazy quantifiers would  favor the minimum match

Laziness (*?,+?,??) would always try to settle with minimum characters needed for the match to be successful before it tries to match the maximum.

The first double quote (‘) appeared was matched using lazy quantifier.

pattern :”.*?”

string :”Joy is prayer”.”Joy is strength”.”Joy i
s Love”.

Matched: <<< “Joy is prayer” >>> .”Joy is strength”.”Joy is Love”.

2b. The only time the greedy quantifiers would give up what they’ve matched earlier and settle for less is ‘when matching too much ends up causing some later part of the regex to fail’.

The \w* would match the whole word ‘regular_expressions’ initially. Later, since ‘s’ didn’t have a character to match and tend to fail would trigger the \w* to backtrack and match one character less. Thus the final ‘s’ matches the ‘s’ just released by \w* and whole match succeeds.

Note: Though the pattern would work the same way without paranthesis, I’d used them to show the individual matches in $1, $2, etc.

pattern : (\w*)(s)

string :regular_expressions

Matched: <<< regular_expressions >>>

$1 = regular_expression

$2 = s

Similarly, the initial match ‘x’ by ‘x*’ was given by later for the favor of the last ‘x’ in the pattern.

pattern : (x*)(x)

string : ox

Matched: o<<< x >>>

$1 =

$2 = x


2c. When more than one greedy quantifiers appears in a pattern, the first greedy would get the preference.

Though the .* initially matched the whole string, the [0-9]+ would able to grab just one digit ‘5’ from the .*, and the 0-9]+ settles with it since that satisfies its minimum match criteriat. Note that the ‘+’ is also a greedy quantifier and here it cant grab beyond its minimum requirement, since already there is an another greedy quantifier shares the same match.

Enter pattern : (.*)([0-9]+)

Enter string : Bangalore-560025

Matched: <<< Bangalore-560025 >>>

$1 = Bangalore-56002

$2 = 5


3. Overall match takes precedence.

Ability to report a successful match takes precedence. As its shown in previous example, if its necessary for a successful match the quantifiers ( greedy or lazy ) would work in harmony with the rest of the pattern.

Advertisements

3 Responses to Regular Expressions: The Rules

  1. i hate the rules, too complicated.

  2. ed hardy uk says:

    "Overall match takes precedence"
    quite agree with you as in this point.

  3. […] I’ve used the following patternchecker script (PERL) & program (JAVA) while coming up with patterns and for demonstration in my previous posts. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: