Introducing patternchecker

January 24, 2010

Often I need a small utility script/program to check the matched results of regex pattern against known text. So I dont have to execute my lengthy main script/program multiple times to get the correct pattern.

I’ve used the following patternchecker script (PERL) & program (JAVA) while coming up with patterns and for demonstration in my previous posts.

Usage:
# java patterncheck
Enter pattern : (x*)(x)(x+)
Enter string:foxxxxx
Matched : fo<<< xxxxx >>>
group 1 = xxx
group 2 = x
group 3 = x

Perl and Java uses the same regex engine and thus would behave similarly while matching patterns, I’ve listed here the sourcecode in perl and java out of my curiosity although you just need any one of them,

i) perl

#!/usr/bin/perl
print "Enter pattern :";
### Read the regex pattern
chomp($pattern=<STDIN>);

### count the number of paranthesis
$cnt++ while( $pattern =~ /\(/g );
print "Enter string :";
chomp($input=<STDIN>);

### read the string to be tested against
if ( $input =~ /$pattern/ ){

      ### print prematch, match, postmatch
      print "Matched: $`<<< $& >>> $'\n";

      ### print $1, $2, etc
      foreach $i (1..${cnt}) { print "\$",$i," = ",${$i},"\n";}
} else {
      print "No Match\n";
}
<div>

ii) java

import java.lang.*;
import java.io.*;
import java.util.regex.*;

public class patterncheck{
	public static void main(String args[]){
		String pattern, str;
		int groups;

		//Read pattern and string from the console
		Console c = System.console();
		pattern=c.readLine("%s","Enter pattern :");
		str=c.readLine("%s","Enter string:");

		//compile and associate the pattern with the text
		Pattern p = Pattern.compile(pattern);
		Matcher m = p.matcher(str);

		//if there is a match
		if (m.find()){
			// print prematch, match & postmatch
			System.out.println("Matched : "+str.substring(0,m.start())+"<<< "+ m.group() +" >>>"+str.substring(m.end()));

			// print what captured in paranthesis
			if (m.groupCount() != 0) {
				groups=m.groupCount();
				for (int i=1;i<=groups;i++){
					System.out.println("group "+i+" = "+m.group(i));
				}
			}
		}
		else {
			System.out.println("No Match");
		}
	}
}
Advertisements

Regular Expressions: Icebreakers

January 24, 2010

Few days back, there was a knowledge session on regular expressions within my team. After discussing the usual topics like greedy & lazy quantifiers, backreferences, etc, we started analyzing match results for few expressions. I ‘ve familiarity with regex and used them in majority of my throw-away scripts and I thought I knew regex unless been baffled with the simple questions from the team. I list down few of those simplest of the simple patterns and what they match and why ( which actually led to me learn the rules of the game),

Before even starting to look at them, did I mentioned earlier that regex engine would start its search just before the first character of the string ? If not, let me tell you now, it need to start before the first character, if and all the patterns  contains anchors ( ^, \b, etc ), it needs to check them too. And the search would go beyond the last character in the string and now you know why ( to match $, \b, etc ).

(i) x*

pattern : x*

string :foxxx

Matched: <<<  >>> foxxx

Explanation: As mentioned in the rule 2 here,  the greediness would always try to match more, hence read the pattern ‘x*’ as  ‘match more occurrence of x or nothing’. And the engine going to do its search character by character in the string. Since it could not find any ‘x’ to match at the starting position, it tries with its other choice ‘ match nothing’ and it succeeds.

(ii) .*

pattern : .*

string :foxxx

Matched: <<< foxxx >>>

Explanation: ‘.’ matches anything other than ‘\n’. Though the pattern ‘.*’ can be read as ‘match more of any characters other than ‘\n’ or nothing’, the rule of greediness gives the preference to match more characters.

(iii) x*

pattern : x*

string : xxxfoxxx

Matched: <<< xxx >>> foxxx

Explanation: Same greediness favors the match more criteria.


Regular Expressions: The Rules

January 24, 2010

The following are the rules, a non-POSIX regular expression engine(such as in PERL, JAVA, etc ) would adhere to while attempting to match with the string,

Notation: the examples would list the given regex(pattern) , the string tested against (string) and the actual match happened in the string in between ‘<<<‘ and ‘>>>’.

1. The match that begins earliest/leftmost wins.

The intention is to match the cat at the end but the ‘cat’ in the catalogue won the match as it appears leftmost in the string.

pattern : cat

string :This catalogue has the names of different species of cat.

Matched: This <<< cat >>> alogue has the names of different species of cat.

1a.The leftmost match in the string wins, irrespective of the order a pattern appears in alternation

Though last in the alternation, ‘catalogue’ got the match as it appeared leftmost among the patterns in the alternation.

pattern :species|names|catalogue

string :This catalogue has the names of different species of cat.

Matched: This <<< catalogue >>>  has the names of different species of cat.

1b. If there are more than one plausible match occurs in the same position, then the order of the plausible matching patterns in the alternation counts.

All three patterns have a possible match at the same position, but ‘over’ is successful as it appeared first in the alternation.

pattern : over|o|overnight

string :Actually, I’m an overnight success. But it took twenty years.

Matched: Actually, I’m an <<< over >>> night success. But it took twenty years.


2. The standard quantifiers (* +, ? and {m,n}) are greedy

Greediness (*,+,?) would always try to match more before it tries to match minimum characters needed for the match to be successful ( ‘0’ for *,? ; ‘1’ for + )

The intention is to match the “Joy is prayer”, though .* went pass across all the double quotes and grabbing all the strings only to match the last double quote (“).

pattern :”.*”

string :”Joy is prayer”.”Joy is strength”.”Joy is Love”.

Matched: <<< “Joy is prayer”.”Joy is strength”.”Joy is Love” >>> .

2a. Lazy quantifiers would  favor the minimum match

Laziness (*?,+?,??) would always try to settle with minimum characters needed for the match to be successful before it tries to match the maximum.

The first double quote (‘) appeared was matched using lazy quantifier.

pattern :”.*?”

string :”Joy is prayer”.”Joy is strength”.”Joy i
s Love”.

Matched: <<< “Joy is prayer” >>> .”Joy is strength”.”Joy is Love”.

2b. The only time the greedy quantifiers would give up what they’ve matched earlier and settle for less is ‘when matching too much ends up causing some later part of the regex to fail’.

The \w* would match the whole word ‘regular_expressions’ initially. Later, since ‘s’ didn’t have a character to match and tend to fail would trigger the \w* to backtrack and match one character less. Thus the final ‘s’ matches the ‘s’ just released by \w* and whole match succeeds.

Note: Though the pattern would work the same way without paranthesis, I’d used them to show the individual matches in $1, $2, etc.

pattern : (\w*)(s)

string :regular_expressions

Matched: <<< regular_expressions >>>

$1 = regular_expression

$2 = s

Similarly, the initial match ‘x’ by ‘x*’ was given by later for the favor of the last ‘x’ in the pattern.

pattern : (x*)(x)

string : ox

Matched: o<<< x >>>

$1 =

$2 = x


2c. When more than one greedy quantifiers appears in a pattern, the first greedy would get the preference.

Though the .* initially matched the whole string, the [0-9]+ would able to grab just one digit ‘5’ from the .*, and the 0-9]+ settles with it since that satisfies its minimum match criteriat. Note that the ‘+’ is also a greedy quantifier and here it cant grab beyond its minimum requirement, since already there is an another greedy quantifier shares the same match.

Enter pattern : (.*)([0-9]+)

Enter string : Bangalore-560025

Matched: <<< Bangalore-560025 >>>

$1 = Bangalore-56002

$2 = 5


3. Overall match takes precedence.

Ability to report a successful match takes precedence. As its shown in previous example, if its necessary for a successful match the quantifiers ( greedy or lazy ) would work in harmony with the rest of the pattern.