Thursday, September 10, 2009

cl-ppcre and replacing strings

So, while I am digging into the macro that is (time), here is a quick note on using cl-ppcre in replacing text in a string.

First simple example.
Let's take a string, break it into separate words with cl-ppcre:split
and return two values, the last word and everything except the last word.
This function is definitely not the most efficient or correct, but will
do for the example.


(let* ((str "John and Susan Q. Public-and-Private")
(last-name (first (last (cl-ppcre:split "\\s+" str))))
(first-name (cl-ppcre:regex-replace last-name str "")))
(values first-name last-name)


More complicated example
Assume we have already defined a variable which holds a text string of legal text which has references to law cites. I want to insert a link around the code sections, but the only actual link we want is the section number itself. We'll use the regex that we used in the last entry as our code-regex pattern.


(cl-ppcre:regex-replace-all code-regex sample-text
'("<a href=\"Laws-display?name=>" :match "\">" :match "</a>"))

"<a href=\"Laws-display?name=>Section 882(a) \">Section 882(a) </a>imposes US tax on a foreign
corporation engaged in a trade or business within the US on its income which is effectively
connected with the conduct of a trade or business inside the US.
<a href=\"Laws-display?name=>Section 864(c)(3). \">Section 864(c)(3). </a>All income from
sources within the United States shall be treated as effectively connected with the conduct of a trade
or business within the United States. Treas. Reg. 1.867-7(a) states that income from the
purchase and sale of personal property shall be treated as derived entirely from the country in
which the property is sold. Treas. Reg. 1.867-7(c) states that a sale of personal property is
consummated at the time when, and the place where, the rights, title and interest of the seller in
the property are transferred to the buyer. <a href=\"Laws-display?name=Section 865(b)
\">Section 865(b) </a>In the case of income derived from the sale of inventory property, such
income shall be sourced under the rules of <a href=\"Laws-display?name=>sections 861(a)(6),
\">sections 861(a)(6), </a><a href=\"Laws-display?name=>section 862(a)(6) \">section 862(a)(6)
</a>and <a href=\"Laws-display?name=>section 863.\">section 863.</a>

<a href=\"Laws-display?name=>Section 861(a)(6) \">Section 861(a)(6) </a>treats inventory
purchased outside the US and sold in the US as US source income. <a href=\"Laws-
display?name=>Section 862(a)(6) \">Section 862(a)(6) </a>treats inventory purchased inside the
US and sold outside the US as foreign source income. <a href=\"Laws-display?name=>Section 863(b)
\">Section 863(b) </a>would allow a split for inventory produced by the taxpayer inside the US
and sold outside the US. <a href=\"Laws-display?name=>Section 865(e)(2)(A) \">Section 865(e)
(2)(A) </a>states that if a nonresident maintains an fixed place of business inside the US., any
sale of inventory attributable to that fixed place of business is sourced in the US regardless
of where the sale occurs. <a href=\"Laws-display?name=>Section 865(e)(2)(B) \">Section 865(e)
(2)(B) </a>states that (A) does not apply if an office of the taxpayer in a foreign country
materially participates in the sale."

This gets us part way there in that we now have a link around the match. However, we really want only the first number and not the word "section". We will want the subsections later, but need to see if we can call a function first. How about a different approach?


(cl-ppcre:regex-replace-all code-regex
sample-text
#'(lambda (match &rest registers)
(concatenate 'string (first registers) "<a href=\"Laws-display?name=" (second registers) "\">"
(second registers) "</a>"))
:simple-calls t)
[Error message about match not being used]

"Section <a href=\"Laws-display?name=882\">882</a>imposes US tax on a foreign corporations
engaged in a trade or business within the US on its income which is effectively connected with
the conduct of a trade or business inside the US. Section <a href=\"Laws-display?name=864
\">864</a>All income from sources within the United States shall be treated as effectively
connected with the conduct of a trade or business within the United States. Treas. Reg.
1.867-7(a) states that income from the purchase and sale of personal property shall be treated
as derived entirely from the country in which the property is sold. Treas. Reg. 1.867-7(c)
states that a sale of personal property is consummated at the time when, and the place where,
the rights, title and interest of the seller in the property are transferred to the buyer.
Section <a href=\"Laws-display?name=865\">865</a>In the case of income derived from the sale of
inventory property, such income shall be sourced under the rules of sections <a href=\"Laws-
display?name=861\">861</a>section <a href=\"Laws-display?name=862\">862</a>and section <a
href=\"Laws-display?name=863\">863</a>Section <a href=\"Laws-display?name=861\">861</a>treats
inventory purchased outside the US and sold in the US as US source income. Section <a
href=\"Laws-display?name=862\">862</a>treats inventory purchased inside the US and sold outside
the US as foreign source income. Section <a href=\"Laws-display?name=863\">863</a>would allow a
split for inventory produced by the taxpayer inside the US and sold outside the US. Section <a
href=\"Laws-display?name=865\">865</a>states that if a nonresident maintains an fixed place of
business inside the US., any sale of inventory attributable to that fixed place of business is
sourced in the US regardless of where the sale occurs. Section <a href=\"Laws-display?name=865
\">865</a>states that (A) does not apply if an office of the taxpayer in a foreign country
materially participates in the sale."


Now it looks better.

Tuesday, September 8, 2009

Simple use of regex matching in profiling

I mentioned yesterday that one of the first things I looked at in lisp was how to concatenate strings. There are several ways and I was curious in how they compared from a system time perspective. Enter the time function. Using the sbcl implementation of time, I tried this:


(time (format nil "~a~a" "string1" "string2"))
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
60,264 processor cycles
3,760 bytes consed

"string1string2"


Good information, but when I am comparing many different functions which do the same thing, and compare them when using different string lengths, this printout is going to get long quick. But also notice a problem. The time function prints out some information, then returns the result of the function being timed. We don't want the result, we just want that info, as a string, to run against a regex. Hmmm. Time to look into the documentation for sbcl. Once we can get that information as a string, then we can run a regex function something like the one below, to generate an html row that we can put in a webpage table.


(defun timing-row (timing-string)
"This is to build an html table row with the output of time when used to profile a function"
(let* ((timing (cl-ppcre:all-matches-as-strings "[\.,0-9]+" timing-string))
(total-run (second timing))
(user-run (third timing))
(system-run (fifth timing))
(cycles (seventh timing))
(bytes-consed (eighth timing)))
(concatenate 'string "" bytes-consed "" cycles "" total-run ""
system-run "" user-run "")))

Monday, September 7, 2009

Beginning Regex with cl-ppcre - Matching

The basic lisp library for dealing in regex patterns is found at Cl-ppcre. The biggest difference between cl-ppcre regex patterns and perl patterns is lisp requires that the backslash be doubled.

You can find a lot of regex tutorial locations on the web. Some places to check would be:
Pattern Matching
Regex Lib,
Regex Buddy,
http://www.pcre.org,

Now lets try some things with regex. We are going to start really simple with scan and matches. Other posts will address creating a scanner and replacing text. Please, anyone who really understands cl-ppcre and lisp, correct my mistakes.

Starting really simple with the scan function:

(cl-ppcre:scan "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
5
12
#(6)
#(11)

What is happening here? The regex pattern looks for a space, followed by everything not a space, until it finds another space. cl-ppcre:scan returns the location in the string before the match, the location after the match and the beginning and end of the match. If all you want is a t or nil, you can create a function like: Returns t if there is a match:



(defun matchesp (rep string)
"Takes a regex pattern and a string, return t if there is a match and nil if not."
(not (null (cl-ppcre:scan rep string))))

(matchesp ".+" "abcdef") => T
(matchesp "x+" "abcdef") => NIL

cl-ppcre:scan-to-strings returns the string match itself.

(cl-ppcre:scan-to-strings "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
" ipsum "
#("ipsum")

Let's change it slightly.

(cl-ppcre:scan-to-strings "\\s+(\\S+_id)\\s+" "id Lorem ipsum dolor_id, sit_id amet_id")
" sit_id "
#("sit_id")

Notice that this doesn't pick up dolor_id because it is followed by a comma, rather than a space and only picks up the first match, not both first and second matches. Notice that it is also picking up the spaces in the match. So lets try again.
Now lets try a more complicated match. This is a lisp regex pattern for US Federal Statute Cites. When you look at the sample text, you will notice that we need to pick up certain citations, but we are trying to avoid Treasury Regulation citations, which look somewhat similar. Because we are just running this from the command line, I'll just define a global variable and set it to the pattern, and define another global variable for the sample text to be used.

(defvar code-regex)
(setf code-regex "\\s+(section |sec\\. |sections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")
(defvar sample-text)
(setf sample-text "Section 882(a) imposes US tax on a foreign corporations engaged in a trade or
business within the US on its income which is effectively connected with the conduct of a trade
or business inside the US.
Section 864(c)(3). All income from sources within the United States shall be treated as
effectively connected with the conduct of a trade or business within the United States. Treas.
Reg. 1.867-7(a) states that income from the purchase and sale of personal property shall be
treated as derived entirely from the country in which the property is sold.
Treas. Reg. 1.867-7(c) states that a sale of personal property is consummated at the time when,
and the place where, the rights, title and interest of the seller in the property are
transferred to the buyer.
Section 865(b) In the case of income derived from the sale of inventory property, such income
shall be sourced under the rules of sections 861(a)(6), section 862(a)(6) and section 863.
Section 861(a)(6) treats inventory purchased outside the US and sold in the US as US source
income. Section 862(a)(6) treats inventory purchased inside the US and sold outside the US as
foreign source income. Section 863(b) would allow a split for inventory produced by the taxpayer
inside the US and sold outside the US.
Section 865(e)(2)(A) states that if a nonresident maintains an fixed place of business inside
the US., any sale of inventory attributable to that fixed place of business is sourced in the US
regardless of where the sale occurs. Section 865(e)(2)(B) states that (A) does not apply if an
office of the taxpayer in a foreign country materially participates in the sale.")

(cl-ppcre:all-matches-as-strings code-regex sample-text)
(" sections 861(a)(6), " " section 863.
")

Ok. Not exactly what we need. It did not pick up where the word "Section" had initial caps. While many languages you could try to make this case insenstive by using the control character /i otherwise known as ignore case, cl-ppcre does not follow this. To use case insensitive patterns, you might look at cl-interpol. In our case, however, we can insert [Ss] because we are really looking just for initial caps. Thus the pattern becomes:

(setf code-regex "\\s+([sS]ection |[Ss]ec\\. |[Ss]ections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")

and the result is:

(cl-ppcre:all-matches-as-strings code-regex sample-text)
("
Section 864(c)(3). "
"
Section 865(b) "
" sections 861(a)(6), " " section 863.
"
" Section 862(a)(6) " " Section 863(b) " "
Section 865(e)(2)(A) "
" Section 865(e)(2)(B) ")

But you notice that there are still a few problems. We did not pick up the leading Section 882(a) because it does not follow a space. One of the matches for sections 861(a)(6) also has a trailing comma, another has a trailing period. Finally, we are picking up line feeds or carriage returns as well. So we will need to fix those on another day.

First Post

OK. It is going to take me awhile to figure out how to do what I want to do with blogspot. Argh. I want real html table formatting people.

I'm starting to look at common lisp and I do a lot of text data manipulation. So one of the first things I looked at was string concatenation. Hmm. Lots of ways of doing it, some of them faster than others.

After reading these entries: http://ryepup.unwashedmeme.com/blog/2007/12/26/string-building-comparison-with-lisp and Concatenate is slow on string concatenation, I decided to
do some timing on my own. If I am interpreting the numbers
correctly, the arnesi:join-strings function is almost always faster
on short strings. HOWEVER, if you replace the short strings with
long strings, it actually becomes the slowest if *PRINT_PRETTY* is
nil. If *PRINT-PRETTY* is T, it is still faster than format and concatenate. I will put up tables as soon as I can figure out how I can format them to my liking here.
the short string results: