Mittwoch, 21. September 2011

Extremely fast regular expressions for Common Lisp

CL-PPCRE is a perl compatible regular expressions module for Common Lisp. Dough Hoyte wrote in his Book Let over Lambda following about it (http://tinyurl.com/6ylpt8y):

CL-PPCRE is fast. Really fast. When compiled with a good native code compiler, benchmarks suggest that for most regular expressions CL-PPCRE is roughly twice as fast as Perl, often much faster. And Perl has one of the fastest non-lisp regular expression engines around: a highly optimised engine written in C.

And cl-ppcre gives you much more, for example, look to following python expression:
>>> re.findall(r'[(].*[)]', "abc(de(fgi)jk(lm)no)pq(rst)x")
['(de(fgi)jk(lm)no)pq(rst)']

You could achieve this with cl-ppcre as follows:
CL-USER> (ppcre:all-matches-as-strings
           "[(].*[)]"
           "abc(de(fgi)jk(lm)no)pq(rst)x")
("(de(fgi)jk(lm)no)pq(rst)")

But you can also use the much more powerfull and better readable lisp syntax:
CL-USER> (ppcre:all-matches-as-strings
           `(:sequence #\(
                       (:greedy-repetition 0 nil :everything)
                       #\))
           "abc(de(fgi)jk(lm)no)pq(rst)x")
("(de(fgi)jk(lm)no)pq(rst)")

It's very simple, but assume we have following function in Lisp:
(defun make-balscan (begin end &optional (deep 0))
  (lambda (pos &optional (chr (aref ppcre::*string* pos)))
    (unless (and (eql chr end) (= deep 0))
      (cond ((eql chr begin) (incf deep))
            ((eql chr end) (decf deep)))
      (1+ pos))))

Then you could write following regular expression:
CL-USER> (ppcre:all-matches-as-strings
           `(:sequence #\(
                       (:greedy-repetition 0 nil
                         (:filter ,(make-balscan #\( #\))))
                       #\))
           "abc(de(fgi)jk(lm)no)pq(rst)x")
("(de(fgi)jk(lm)no)" "(rst)")

Congratulations! You have got a regular expression which takes care about balanced parenthesis!
Try to do that in Python, Perl, AWK, JAVA, C/C++, or what ever language you want :-)