Saturday, February 20, 2010

short cl-dot tutorial

I wanted to automatically generate some diagrams from a database. I've
done it before, in other languages, generating strings to pass on to
graphviz, and decided to look at the lisp alternatives. Obviously I could write something to again just dump strings, but looking at the lisp libraries, there seemed to be two potential other alternatives:
cl-dot and s-dot.
Cl-dot seems to be written from a clos perspective, and s-dot is written from the prespective of generating s-expressions. I don't have any particular allegience in a holy war here, so decided to write out a short tutorial in both for myself and see if I had a preference.
Keeping it simple and avoiding all the database stuff, I decided to use two sets of sample information. The first set has class instances that point to a single other class instance in a list of instances. The second set has a list of class instances and a second list showing connections. You can think of the first set as manager reporting where each person only has one manager. The second set links people to projects and indicates how much time they spend on each project.

(defclass person ()
((id :accessor id :initarg :id )
(name :accessor name :initarg :name )
(manager-id :accessor manager-id :initarg :manager-id)))
(defclass project ()
((id :accessor id :initarg :id )
(name :accessor name :initarg :name)))
(defclass person-project ()
((id :accessor id :initarg :id)
(person-id :accessor person-id :initarg :person-id)
(project-id :accessor project-id :initarg :project-id)
(percent-allocated :accessor percent-allocated :initarg :percent-allocated)))
(defvar *person-list* (list
(make-instance 'person :id 2 :manager-id 2 :name "Mark")
(make-instance 'person :id 1 :manager-id 2 :name "John")
(make-instance 'person :id 3 :manager-id 2 :name "Janet")
(make-instance 'person :id 4 :manager-id 1 :name "Steve")
(make-instance 'person :id 5 :manager-id 3 :name "Paul")
(make-instance 'person :id 6 :manager-id 3 :name "Chris")))
(defvar *person-project-list*)
(setf *person-project-list* (list
(make-instance 'person-project :id 2 :person-id 1 :project-id 1 :percent-allocated 100)
(make-instance 'person-project :id 1 :person-id 2 :project-id 1 :percent-allocated 70)
(make-instance 'person-project :id 3 :person-id 3 :project-id 1 :percent-allocated 10)
(make-instance 'person-project :id 4 :person-id 4 :project-id 2 :percent-allocated 40)
(make-instance 'person-project :id 5 :person-id 5 :project-id 2 :percent-allocated 25)
(make-instance 'person-project :id 6 :person-id 2 :project-id 3 :percent-allocated 30)
(make-instance 'person-project :id 7 :person-id 3 :project-id 3 :percent-allocated 90)
(make-instance 'person-project :id 8 :person-id 4 :project-id 1 :percent-allocated 60)
(make-instance 'person-project :id 9 :person-id 5 :project-id 3 :percent-allocated 75)
(make-instance 'person-project :id 10 :person-id 6 :project-id 2 :percent-allocated 100)))

(defvar *project-list*)
(setf *project-list*
(list (make-instance 'project :id 1 :name
"Project Alpha")
(make-instance 'project :id 2 :name
"Project Beta")
(make-instance 'project :id 3 :name
"Project Gamma")))
(defun find-item-by-id (id item-list)
"Returns the item from the list with the specific id"
(remove-if-not #'(lambda (x) (= id (id x))) item-list))
(defun find-person (id)
"Returns an instance of person"
(car (find-item-by-id id *person-list*)))
(defun find-project-by-id (id)
"Returns the project with the specific id"
(car (remove-if-not #'(lambda (x) (= id (id x))) *project-list*)))
(defun find-manager (person)
"Returns the person's manager"
(car (remove-if-not #'(lambda (x) (= (manager-id person) (id x))) *person-list*)))

(defun find-projects-by-person (person)
"Returns a list of project instances connected to the person"
(mapcar #'(lambda (x) (find-project-by-id x))
(mapcar #'(lambda (x) (project-id x))
(remove-if-not #'(lambda (x) (= (id person) (person-id x)))
*person-project-list*))))

(defun find-project-percent (person project)
(percent-allocated (car (remove-if-not #'(lambda (x)
(and (= (person-id x) (id person))
(= (project-id x) (id project))))
*person-project-list*))))

If you are wondering, yes, it was deliberate to not put the persons in id order, just to see how it would affect the diagrams. Anyway, this just sets the stage for the comparison of the two dot-diagram libraries.

The first thing we need to do in the cl-dot version is create methods for a few generic methods. Notice that these methods will be tied to creating a manager-graph. You will have to redefine the methods to create other types of graphs.

(defmethod cl-dot:graph-object-node ((graph (eql 'manager-graph)) (object person))
(let ((name (name object)))
(make-instance 'cl-dot:node
:attributes `(:label ,name
:shape :box))))

(defmethod cl-dot:graph-object-points-to ((graph (eql 'manager-graph)) (object person))
(unless (= (id object) (manager-id object))
(list
(make-instance 'cl-dot:attributed
:object (find-manager object)
:attributes '(:weight 1)))))

Now to try it out. Replace the path name for the file with your requirements. You can also change the format to jpg, png, ps, pdf etc.

(let ((dgraph (cl-dot:generate-graph-from-roots 'manager-graph *person-list* '(:rankdir "RL"))))

(cl-dot:dot-graph dgraph "/home/sabra/test/manager-test.png" :format :png))

The first thing I notice is that the arrows are pointing from the person to the manager. If I wanted the arrows pointing the other way, I need to change things a bit. There is a generic method for "graph-object-pointed-to-by". This is described as "The edges will be directed from the other objects to this one." So let's try that.
(defmethod cl-dot:graph-object-pointed-to-by ((graph (eql 'manager-graph)) (object person))

;;Don't want people pointing at themselves
(unless (= (id object) (manager-id object))
(list
(make-instance 'cl-dot:attributed
:object (find-manager object)
:attributes '(:weight 1)))))

Ok. That results in a complete mess. I would have expected the same chart, but with double sets of arrows. Instead, I got something rather different. You might try it and see what you get. If we null out the "graph-object-points-to" method. We do get what we expected.

Now let's see if we can come up with a sensible graph for pointing people at projects and indicating, on the edges, the percentage of time each person spends on the project.

First, we need another node method for persons, this time tied to a project-graph. Then we need a node method for projects and a "points to" method showing persons pointing at projects, again all tied to a project graph.
(defmethod cl-dot:graph-object-node ((graph (eql 'project-graph)) (object person))
(let ((name (name object)))
(make-instance 'cl-dot:node
:attributes `(:label ,name
:shape :box))))

(defmethod cl-dot:graph-object-node ((graph (eql 'project-graph)) (object project))
(let ((name (name object)))
(make-instance 'cl-dot:node
:attributes `(:label ,name
:shape :box))))

(defmethod cl-dot:graph-object-points-to ((graph (eql 'project-graph)) (object person))
(mapcar #'(lambda (x)
(let ((edge-name (format nil "~a%" (find-project-percent object x))))
(make-instance 'cl-dot:attributed
:object x
:attributes `(:weight 1 :label ,edge-name))))
(find-projects-by-person object)))

At this point, calling something like:
(let ((dgraph (cl-dot:generate-graph-from-roots 'project-graph *person-list* '(:rankdir "LR"))))

(cl-dot:dot-graph dgraph "/home/sabra/test/project-test.png" :format :png))
results in a graph, but not a particularly attractive one. In addition, you will note that the persons are arranged in reverse order to the person list. That is something you are going to have to work out on your own.

Thursday, September 10, 2009

cl-ppcre and replacing strings

So, while I am digging into the macro that is (time), here is a quick note on using cl-ppcre in replacing text in a string.

First simple example.
Let's take a string, break it into separate words with cl-ppcre:split
and return two values, the last word and everything except the last word.
This function is definitely not the most efficient or correct, but will
do for the example.


(let* ((str "John and Susan Q. Public-and-Private")
(last-name (first (last (cl-ppcre:split "\\s+" str))))
(first-name (cl-ppcre:regex-replace last-name str "")))
(values first-name last-name)


More complicated example
Assume we have already defined a variable which holds a text string of legal text which has references to law cites. I want to insert a link around the code sections, but the only actual link we want is the section number itself. We'll use the regex that we used in the last entry as our code-regex pattern.


(cl-ppcre:regex-replace-all code-regex sample-text
'("<a href=\"Laws-display?name=>" :match "\">" :match "</a>"))

"<a href=\"Laws-display?name=>Section 882(a) \">Section 882(a) </a>imposes US tax on a foreign
corporation engaged in a trade or business within the US on its income which is effectively
connected with the conduct of a trade or business inside the US.
<a href=\"Laws-display?name=>Section 864(c)(3). \">Section 864(c)(3). </a>All income from
sources within the United States shall be treated as effectively connected with the conduct of a trade
or business within the United States. Treas. Reg. 1.867-7(a) states that income from the
purchase and sale of personal property shall be treated as derived entirely from the country in
which the property is sold. Treas. Reg. 1.867-7(c) states that a sale of personal property is
consummated at the time when, and the place where, the rights, title and interest of the seller in
the property are transferred to the buyer. <a href=\"Laws-display?name=Section 865(b)
\">Section 865(b) </a>In the case of income derived from the sale of inventory property, such
income shall be sourced under the rules of <a href=\"Laws-display?name=>sections 861(a)(6),
\">sections 861(a)(6), </a><a href=\"Laws-display?name=>section 862(a)(6) \">section 862(a)(6)
</a>and <a href=\"Laws-display?name=>section 863.\">section 863.</a>

<a href=\"Laws-display?name=>Section 861(a)(6) \">Section 861(a)(6) </a>treats inventory
purchased outside the US and sold in the US as US source income. <a href=\"Laws-
display?name=>Section 862(a)(6) \">Section 862(a)(6) </a>treats inventory purchased inside the
US and sold outside the US as foreign source income. <a href=\"Laws-display?name=>Section 863(b)
\">Section 863(b) </a>would allow a split for inventory produced by the taxpayer inside the US
and sold outside the US. <a href=\"Laws-display?name=>Section 865(e)(2)(A) \">Section 865(e)
(2)(A) </a>states that if a nonresident maintains an fixed place of business inside the US., any
sale of inventory attributable to that fixed place of business is sourced in the US regardless
of where the sale occurs. <a href=\"Laws-display?name=>Section 865(e)(2)(B) \">Section 865(e)
(2)(B) </a>states that (A) does not apply if an office of the taxpayer in a foreign country
materially participates in the sale."

This gets us part way there in that we now have a link around the match. However, we really want only the first number and not the word "section". We will want the subsections later, but need to see if we can call a function first. How about a different approach?


(cl-ppcre:regex-replace-all code-regex
sample-text
#'(lambda (match &rest registers)
(concatenate 'string (first registers) "<a href=\"Laws-display?name=" (second registers) "\">"
(second registers) "</a>"))
:simple-calls t)
[Error message about match not being used]

"Section <a href=\"Laws-display?name=882\">882</a>imposes US tax on a foreign corporations
engaged in a trade or business within the US on its income which is effectively connected with
the conduct of a trade or business inside the US. Section <a href=\"Laws-display?name=864
\">864</a>All income from sources within the United States shall be treated as effectively
connected with the conduct of a trade or business within the United States. Treas. Reg.
1.867-7(a) states that income from the purchase and sale of personal property shall be treated
as derived entirely from the country in which the property is sold. Treas. Reg. 1.867-7(c)
states that a sale of personal property is consummated at the time when, and the place where,
the rights, title and interest of the seller in the property are transferred to the buyer.
Section <a href=\"Laws-display?name=865\">865</a>In the case of income derived from the sale of
inventory property, such income shall be sourced under the rules of sections <a href=\"Laws-
display?name=861\">861</a>section <a href=\"Laws-display?name=862\">862</a>and section <a
href=\"Laws-display?name=863\">863</a>Section <a href=\"Laws-display?name=861\">861</a>treats
inventory purchased outside the US and sold in the US as US source income. Section <a
href=\"Laws-display?name=862\">862</a>treats inventory purchased inside the US and sold outside
the US as foreign source income. Section <a href=\"Laws-display?name=863\">863</a>would allow a
split for inventory produced by the taxpayer inside the US and sold outside the US. Section <a
href=\"Laws-display?name=865\">865</a>states that if a nonresident maintains an fixed place of
business inside the US., any sale of inventory attributable to that fixed place of business is
sourced in the US regardless of where the sale occurs. Section <a href=\"Laws-display?name=865
\">865</a>states that (A) does not apply if an office of the taxpayer in a foreign country
materially participates in the sale."


Now it looks better.

Tuesday, September 8, 2009

Simple use of regex matching in profiling

I mentioned yesterday that one of the first things I looked at in lisp was how to concatenate strings. There are several ways and I was curious in how they compared from a system time perspective. Enter the time function. Using the sbcl implementation of time, I tried this:


(time (format nil "~a~a" "string1" "string2"))
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
60,264 processor cycles
3,760 bytes consed

"string1string2"


Good information, but when I am comparing many different functions which do the same thing, and compare them when using different string lengths, this printout is going to get long quick. But also notice a problem. The time function prints out some information, then returns the result of the function being timed. We don't want the result, we just want that info, as a string, to run against a regex. Hmmm. Time to look into the documentation for sbcl. Once we can get that information as a string, then we can run a regex function something like the one below, to generate an html row that we can put in a webpage table.


(defun timing-row (timing-string)
"This is to build an html table row with the output of time when used to profile a function"
(let* ((timing (cl-ppcre:all-matches-as-strings "[\.,0-9]+" timing-string))
(total-run (second timing))
(user-run (third timing))
(system-run (fifth timing))
(cycles (seventh timing))
(bytes-consed (eighth timing)))
(concatenate 'string "" bytes-consed "" cycles "" total-run ""
system-run "" user-run "")))

Monday, September 7, 2009

Beginning Regex with cl-ppcre - Matching

The basic lisp library for dealing in regex patterns is found at Cl-ppcre. The biggest difference between cl-ppcre regex patterns and perl patterns is lisp requires that the backslash be doubled.

You can find a lot of regex tutorial locations on the web. Some places to check would be:
Pattern Matching
Regex Lib,
Regex Buddy,
http://www.pcre.org,

Now lets try some things with regex. We are going to start really simple with scan and matches. Other posts will address creating a scanner and replacing text. Please, anyone who really understands cl-ppcre and lisp, correct my mistakes.

Starting really simple with the scan function:

(cl-ppcre:scan "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
5
12
#(6)
#(11)

What is happening here? The regex pattern looks for a space, followed by everything not a space, until it finds another space. cl-ppcre:scan returns the location in the string before the match, the location after the match and the beginning and end of the match. If all you want is a t or nil, you can create a function like: Returns t if there is a match:



(defun matchesp (rep string)
"Takes a regex pattern and a string, return t if there is a match and nil if not."
(not (null (cl-ppcre:scan rep string))))

(matchesp ".+" "abcdef") => T
(matchesp "x+" "abcdef") => NIL

cl-ppcre:scan-to-strings returns the string match itself.

(cl-ppcre:scan-to-strings "\\s+(\\S+)\\s+" "Lorem ipsum dolor sit amet")
" ipsum "
#("ipsum")

Let's change it slightly.

(cl-ppcre:scan-to-strings "\\s+(\\S+_id)\\s+" "id Lorem ipsum dolor_id, sit_id amet_id")
" sit_id "
#("sit_id")

Notice that this doesn't pick up dolor_id because it is followed by a comma, rather than a space and only picks up the first match, not both first and second matches. Notice that it is also picking up the spaces in the match. So lets try again.
Now lets try a more complicated match. This is a lisp regex pattern for US Federal Statute Cites. When you look at the sample text, you will notice that we need to pick up certain citations, but we are trying to avoid Treasury Regulation citations, which look somewhat similar. Because we are just running this from the command line, I'll just define a global variable and set it to the pattern, and define another global variable for the sample text to be used.

(defvar code-regex)
(setf code-regex "\\s+(section |sec\\. |sections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")
(defvar sample-text)
(setf sample-text "Section 882(a) imposes US tax on a foreign corporations engaged in a trade or
business within the US on its income which is effectively connected with the conduct of a trade
or business inside the US.
Section 864(c)(3). All income from sources within the United States shall be treated as
effectively connected with the conduct of a trade or business within the United States. Treas.
Reg. 1.867-7(a) states that income from the purchase and sale of personal property shall be
treated as derived entirely from the country in which the property is sold.
Treas. Reg. 1.867-7(c) states that a sale of personal property is consummated at the time when,
and the place where, the rights, title and interest of the seller in the property are
transferred to the buyer.
Section 865(b) In the case of income derived from the sale of inventory property, such income
shall be sourced under the rules of sections 861(a)(6), section 862(a)(6) and section 863.
Section 861(a)(6) treats inventory purchased outside the US and sold in the US as US source
income. Section 862(a)(6) treats inventory purchased inside the US and sold outside the US as
foreign source income. Section 863(b) would allow a split for inventory produced by the taxpayer
inside the US and sold outside the US.
Section 865(e)(2)(A) states that if a nonresident maintains an fixed place of business inside
the US., any sale of inventory attributable to that fixed place of business is sourced in the US
regardless of where the sale occurs. Section 865(e)(2)(B) states that (A) does not apply if an
office of the taxpayer in a foreign country materially participates in the sale.")

(cl-ppcre:all-matches-as-strings code-regex sample-text)
(" sections 861(a)(6), " " section 863.
")

Ok. Not exactly what we need. It did not pick up where the word "Section" had initial caps. While many languages you could try to make this case insenstive by using the control character /i otherwise known as ignore case, cl-ppcre does not follow this. To use case insensitive patterns, you might look at cl-interpol. In our case, however, we can insert [Ss] because we are really looking just for initial caps. Thus the pattern becomes:

(setf code-regex "\\s+([sS]ection |[Ss]ec\\. |[Ss]ections |§ |§|§|§ )([0-9]+[^-,.\\s])+(\\(\\w+\\))?(\\([0-9]*\\))?(\\(\w+\\))?(\\(\\w+\\))?([\.|,|\\s])+")

and the result is:

(cl-ppcre:all-matches-as-strings code-regex sample-text)
("
Section 864(c)(3). "
"
Section 865(b) "
" sections 861(a)(6), " " section 863.
"
" Section 862(a)(6) " " Section 863(b) " "
Section 865(e)(2)(A) "
" Section 865(e)(2)(B) ")

But you notice that there are still a few problems. We did not pick up the leading Section 882(a) because it does not follow a space. One of the matches for sections 861(a)(6) also has a trailing comma, another has a trailing period. Finally, we are picking up line feeds or carriage returns as well. So we will need to fix those on another day.

First Post

OK. It is going to take me awhile to figure out how to do what I want to do with blogspot. Argh. I want real html table formatting people.

I'm starting to look at common lisp and I do a lot of text data manipulation. So one of the first things I looked at was string concatenation. Hmm. Lots of ways of doing it, some of them faster than others.

After reading these entries: http://ryepup.unwashedmeme.com/blog/2007/12/26/string-building-comparison-with-lisp and Concatenate is slow on string concatenation, I decided to
do some timing on my own. If I am interpreting the numbers
correctly, the arnesi:join-strings function is almost always faster
on short strings. HOWEVER, if you replace the short strings with
long strings, it actually becomes the slowest if *PRINT_PRETTY* is
nil. If *PRINT-PRETTY* is T, it is still faster than format and concatenate. I will put up tables as soon as I can figure out how I can format them to my liking here.
the short string results: