Monday, September 7, 2009

Beginning Regex with cl-ppcre - Matching

The basic lisp library for dealing in regex patterns is found at Cl-ppcre. The biggest difference between cl-ppcre regex patterns and perl patterns is lisp requires that the backslash be doubled.

You can find a lot of regex tutorial locations on the web. Some places to check would be:
Pattern Matching
Regex Lib,
Regex Buddy,
http://www.pcre.org,

Now lets try some things with regex. We are going to start really simple with scan and matches. Other posts will address creating a scanner and replacing text. Please, anyone who really understands cl-ppcre and lisp, correct my mistakes.

Starting really simple with the scan function:

(cl-ppcre:scan "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
5
12
#(6)
#(11)

What is happening here? The regex pattern looks for a space, followed by everything not a space, until it finds another space. cl-ppcre:scan returns the location in the string before the match, the location after the match and the beginning and end of the match. If all you want is a t or nil, you can create a function like: Returns t if there is a match:



(defun matchesp (rep string)
"Takes a regex pattern and a string, return t if there is a match and nil if not."
(not (null (cl-ppcre:scan rep string))))

(matchesp ".+" "abcdef") => T
(matchesp "x+" "abcdef") => NIL

cl-ppcre:scan-to-strings returns the string match itself.

(cl-ppcre:scan-to-strings "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
" ipsum "
#("ipsum")

Let's change it slightly.

(cl-ppcre:scan-to-strings "\\s+(\\S+_id)\\s+" "id Lorem ipsum dolor_id, sit_id amet_id")
" sit_id "
#("sit_id")

Notice that this doesn't pick up dolor_id because it is followed by a comma, rather than a space and only picks up the first match, not both first and second matches. Notice that it is also picking up the spaces in the match. So lets try again.
Now lets try a more complicated match. This is a lisp regex pattern for US Federal Statute Cites. When you look at the sample text, you will notice that we need to pick up certain citations, but we are trying to avoid Treasury Regulation citations, which look somewhat similar. Because we are just running this from the command line, I'll just define a global variable and set it to the pattern, and define another global variable for the sample text to be used.

(defvar code-regex)
(setf code-regex "\\s+(section |sec\\. |sections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")
(defvar sample-text)
(setf sample-text "Section 882(a) imposes US tax on a foreign corporations engaged in a trade or
business within the US on its income which is effectively connected with the conduct of a trade
or business inside the US.
Section 864(c)(3). All income from sources within the United States shall be treated as
effectively connected with the conduct of a trade or business within the United States. Treas.
Reg. 1.867-7(a) states that income from the purchase and sale of personal property shall be
treated as derived entirely from the country in which the property is sold.
Treas. Reg. 1.867-7(c) states that a sale of personal property is consummated at the time when,
and the place where, the rights, title and interest of the seller in the property are
transferred to the buyer.
Section 865(b) In the case of income derived from the sale of inventory property, such income
shall be sourced under the rules of sections 861(a)(6), section 862(a)(6) and section 863.
Section 861(a)(6) treats inventory purchased outside the US and sold in the US as US source
income. Section 862(a)(6) treats inventory purchased inside the US and sold outside the US as
foreign source income. Section 863(b) would allow a split for inventory produced by the taxpayer
inside the US and sold outside the US.
Section 865(e)(2)(A) states that if a nonresident maintains an fixed place of business inside
the US., any sale of inventory attributable to that fixed place of business is sourced in the US
regardless of where the sale occurs. Section 865(e)(2)(B) states that (A) does not apply if an
office of the taxpayer in a foreign country materially participates in the sale.")

(cl-ppcre:all-matches-as-strings code-regex sample-text)
(" sections 861(a)(6), " " section 863.
")

Ok. Not exactly what we need. It did not pick up where the word "Section" had initial caps. While many languages you could try to make this case insenstive by using the control character /i otherwise known as ignore case, cl-ppcre does not follow this. To use case insensitive patterns, you might look at cl-interpol. In our case, however, we can insert [Ss] because we are really looking just for initial caps. Thus the pattern becomes:

(setf code-regex "\\s+([sS]ection |[Ss]ec\\. |[Ss]ections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")

and the result is:

(cl-ppcre:all-matches-as-strings code-regex sample-text)
("
Section 864(c)(3). "
"
Section 865(b) "
" sections 861(a)(6), " " section 863.
"
" Section 862(a)(6) " " Section 863(b) " "
Section 865(e)(2)(A) "
" Section 865(e)(2)(B) ")

But you notice that there are still a few problems. We did not pick up the leading Section 882(a) because it does not follow a space. One of the matches for sections 861(a)(6) also has a trailing comma, another has a trailing period. Finally, we are picking up line feeds or carriage returns as well. So we will need to fix those on another day.

No comments:

Post a Comment