Write a compiler next time

I was recently tempted to write a hand optimized version of the boolean expression in the following Java method:


  public static boolean endsWithReal(String t) {
     return t.endsWith( ".ram")  || t.endsWith( ".rpm") ||
            t.endsWith( ".rm")   || t.endsWith( ".ra")  ||
	    t.endsWith( ".smi")  || t.endsWith( ".smil");
    }

I thought writing such code by hand, and testing it, would take a half hour or so. And there were only two pieces of code I wanted to optimized, though one day there might be more. However, writing the optimized code correctly was tricky because it required 20 "&&" and "||"'s and took over 500 characters.

So after struggling with this for a while, I believed it wouldn't take too long to write a compiler to generate the code. The compiled code could be tested for correctness, and be reused for similar applications. I also hoped to learn something.

This document describes the approach:

This is not as much work as it might seem. Since the language we are writing a compiler or is quite tiny, each step does not require much code. The lines of code required for each step is show in "()" above. The total was 107. Also, using Scheme lets one focus on code transformation without a lot of code. Using another function language, like Haskel, might take even less code.

Write a simple interpreter

We start with a straight forward version written in SILK:


(define patterns (list ".ra" ".rm" ".ram" ".rpm" ".smi" ".smil"))

(define (endsWith target patterns)   (matches target patterns #t))
(define (startsWith target patterns) (matches target patterns #f))

(define (matches target patterns fromEnd)
  (let ((lt (.length target)))
    (define (loop patterns)
      (and (not (null? patterns))
	   (or (match0 (car patterns))
	       (loop (cdr patterns)))))
    (define (match0 pattern)
      (let ((lp (.length pattern)))
	(define (match off i)
	  (define (sameChar)
	    (eqv? (.charAt target (- off i)) (.charAt pattern (- lp i))))
	  (or (= i 0) (and (sameChar) (match off (- i 1)))))
	(and (>= lt lp)
	     (match (if fromEnd lt lp) lp))))
    (loop patterns))

The (matches) procedure talkes a target string and matches it against a list of pattern strings, returning when the first match occurs. The match starts from the end of the target string if fromEnd is true, otherwise it starts from the beginning.

To make the compiler generation easier, this is written in a functional style using (and) and (or) rather than (if).

One can think of this as a simple interpreter that takes a program specified by patterns and fromEnd. The interpreter runs the program against the input data target.


Test that the interpreter works

Here are some tests. Should be more?


(begin
  (assert (endsWith "foo.ram" patterns))
  (assert (not (endsWith "foo.bar" patterns)))
  (assert (startsWith "happy.bar" (list "happy")))
  (assert (not (startsWith "Unhappy.bar" (list "happy")))))


Transform the interpreter into a code generator

If fromEnd and patterns are, static, that is, known at compile time, and target is dynamic, that is, known at runtime, we can propagate this information through (matches). Here, red indicates a static expression that produces a compile time constant, green represents a dynamic expression that must be evaluated at runtime, and blue represents an expression that must be checked at compile time to see if it is static or dynamic.


(define (matches target patterns fromEnd)
  (let ((lt (.length target)))
    (define (loop patterns)
      (and (not (null? patterns))
	   (or (match0 (car patterns))
	       (loop (cdr patterns)))))
    (define (match0 pattern)
      (let ((lp (.length pattern)))
	(define (match off i)
	  (define (sameChar)
	    (eqv? (.charAt target (- off i)) (.charAt pattern (- lp i))))
	  (or (= i 0) (and (sameChar) (match off (- i 1)))))
	(and (>= lt lp)
	     (match (if fromEnd lt lp) lp))))
    (loop patterns))

Using this analysis we can write the following code generator. The generated code assumes that the following registers are available:


t - the target string.
lt - the length of the target string.


(define (gen-matches patterns fromEnd)
  (define (loop patterns)
    (and (not (null? patterns))
	 `(or ,(match0 (car patterns))
	      ,(loop (cdr patterns)))))
  (define (match0 pattern)
    (let ((lp (.length pattern)))
      (define (match off i)
	(define (sameChar)
	  (if (number? off)
	      `(eqv? (.charAt t ,(- off i)) ,(.charAt pattern (- lp i)))
	      `(eqv? (.charAt t (- ,off ,i)) ,(.charAt pattern (- lp i)))))
	(or (= i 0) `(and ,(sameChar) ,(match off (- i 1)))))
      `(and (>= lt ,lp)
	    ,(match (if fromEnd 'lt lp) lp))))
  (loop patterns))

The code generator produces the following code:


> (pretty-print (gen-matches patterns #t))
(or (and (>= lt 3)
         (and (eqv? (.charAt t (- lt 3)) #\.)
              (and (eqv? (.charAt t (- lt 2)) #\r)
                   (and (eqv? (.charAt t (- lt 1)) #\a) #t))))
    (or (and (>= lt 3)
             (and (eqv? (.charAt t (- lt 3)) #\.)
                  (and (eqv? (.charAt t (- lt 2)) #\r)
                       (and (eqv? (.charAt t (- lt 1)) #\m) #t))))
        (or (and (>= lt 4)
                 (and (eqv? (.charAt t (- lt 4)) #\.)
                      (and (eqv? (.charAt t (- lt 3)) #\r)
                           (and (eqv? (.charAt t (- lt 2)) #\a)
                                (and (eqv? (.charAt t (- lt 1)) #\m) #t)))))
            (or (and (>= lt 4)
                     (and (eqv? (.charAt t (- lt 4)) #\.)
                          (and (eqv? (.charAt t (- lt 3)) #\r)
                               (and (eqv? (.charAt t (- lt 2)) #\p)
                                    (and (eqv? (.charAt t (- lt 1)) #\m)
                                         #t)))))
                (or (and (>= lt 4)
                         (and (eqv? (.charAt t (- lt 4)) #\.)
                              (and (eqv? (.charAt t (- lt 3)) #\s)
                                   (and (eqv? (.charAt t (- lt 2)) #\m)
                                        (and (eqv? (.charAt t (- lt 1)) #\i)
                                             #t)))))
                    (or (and (>= lt 5)
                             (and (eqv? (.charAt t (- lt 5)) #\.)
                                  (and (eqv? (.charAt t (- lt 4)) #\s)
                                       (and (eqv? (.charAt t (- lt 3)) #\m)
                                            (and (eqv? (.charAt t (- lt 2))
                                                       #\i)
                                                 (and (eqv? (.charAt
                                                              t
                                                              (- lt 1))
                                                            #\l)
                                                      #t))))))
                        #f))))))


Optimize the generated code

The generated code still performs redundant tests because the generator produces an or of ands. The redundant tests can be removed by reorganizing the and/or tree. This requires three changes to (gen-matches), mared in bold:

(define (gen-matches-2 patterns fromEnd)
  (define (loop patterns)
    (and (not (null? patterns))
	 (make-or (match0 (car patterns))
		  (loop (cdr patterns)))))
  (define (match0 pattern)
    (let ((lp (.length pattern)))
      (define (match off i)
	(define (sameChar)
	  (if (number? off)
	      `(eqv? (.charAt t ,(- off i)) ,(.charAt pattern (- lp i)))
	      `(eqv? (.charAt t (- ,off ,i)) ,(.charAt pattern (- lp i)))))
	(or (= i 0) (make-and (sameChar) (match off (- i 1)))))
      (make-and `(>= lt ,lp)
		(match (if fromEnd 'lt lp) lp))))
  (loop patterns))

Procedures (make-or) and (make-and) construct an and/or tree applying optimizations as it is constructed. They assume that the tests perform no side effect, which is the case here.

(define (or? x) (and (pair? x) (eq? (car x) 'or)))
(define or-a cadr)
(define or-b caddr)

(define (and? x) (and (pair? x) (eq? (car x) 'and)))
(define and-a cadr)
(define and-b caddr)

(define (make-and a b)
  (cond
   ((eq? a #t) b)
   ((eq? a #f) #f)
   ((eq? b #t) a)
   ((eq? b #f) #f)
   ((equal? a b) a)
   (else (list 'and a b))))

(define (make-or a b)
  (cond
   ((eq? a #t) #t)
   ((eq? a #f) b)
   ((eq? b #t) #t)
   ((eq? b #f) a)
   ((equal? a b) a)
   ((and (and? a) (and? b) (equal? (and-a a) (and-a b)))
    (make-and (and-a a) (make-or (and-b a) (and-b b))))
   ((and (and? a) (or? b) (and? (or-a b))
	 (equal? (and-a a) (and-a (or-a b))))
    (make-or (make-and (and-a a)
		       (make-or (and-b a)
				(and-b (or-a b))))
	     (or-b b)))
   (else (list 'or a b))))

Here are some tests.


(assert (equal? (make-or '(and a b) '(and a c)) `(and a (or b c))))

(assert (equal? (make-or '(and a b) '(or (and a c) d)) '(or (and a (or b c)) d)))

The resulting generated code looks like this:


> (pretty-print (gen-matches-2 patterns #t))
(or (and (>= lt 3)
         (and (eqv? (.charAt t (- lt 3)) #\.)
              (and (eqv? (.charAt t (- lt 2)) #\r)
                   (or (eqv? (.charAt t (- lt 1)) #\a)
                       (eqv? (.charAt t (- lt 1)) #\m)))))
    (or (and (>= lt 4)
             (and (eqv? (.charAt t (- lt 4)) #\.)
                  (or (and (eqv? (.charAt t (- lt 3)) #\r)
                           (or (and (eqv? (.charAt t (- lt 2)) #\a)
                                    (eqv? (.charAt t (- lt 1)) #\m))
                               (and (eqv? (.charAt t (- lt 2)) #\p)
                                    (eqv? (.charAt t (- lt 1)) #\m))))
                      (and (eqv? (.charAt t (- lt 3)) #\s)
                           (and (eqv? (.charAt t (- lt 2)) #\m)
                                (eqv? (.charAt t (- lt 1)) #\i))))))
        (and (>= lt 5)
             (and (eqv? (.charAt t (- lt 5)) #\.)
                  (and (eqv? (.charAt t (- lt 4)) #\s)
                       (and (eqv? (.charAt t (- lt 3)) #\m)
                            (and (eqv? (.charAt t (- lt 2)) #\i)
                                 (eqv? (.charAt t (- lt 1)) #\l))))))))


Test that the generated code works correctly

The code generator can be tested by wrapping it in a macro:

(define-macro (matches target patterns fromEnd)
  `(let* ((t ,target)
	  (lt (.length target)))
     ,(gen-matches-2 patterns fromEnd)))
     
(define (endsWith-real target)
  (matches target (".ra" ".rm" ".ram" ".rpm" ".smi" ".smil") #t))

(begin
  (assert (endsWith-real "foo.ram"))
  (assert (not (endsWith-real "foo.bar"))))

The macro sets up the t and lt registers before generating the code. The patterns and fromEnd values are passed to the macro as constants.


Generate code for the target language, Java

The compiler can be quite simple because it only deals with a tiny part of Java, boolean expressions. Using SILK mkaes it easy to identify and generate code for instance methods.


(define (compile exp)
  (define (converter op) (if (equal? op "eqv?") "==" op))
  (define (dotted? exp)
    (and (pair? exp)
	 (symbol? (car exp))
	 (eqv? (.charAt (symbol->string (car exp)) 0) #'.')))	       
  (cond ((pair? exp)
	  (cond
	   ((or? exp)
	    (string-append "(" (compile (or-a exp)) " || " (compile (or-b exp)) ")"))
	   ((and? exp)
	    (string-append (compile (and-a exp)) " && " (compile (and-b exp))))
	   ((dotted? exp) (compile-dotted exp))
	   (else (string-append (compile (list-ref exp 1)) " "
			      (converter (compile (list-ref exp 0))) " "
			      (compile (list-ref exp 2))))))
	((char? exp) (string-append "'" exp "'"))
	(else (string-append exp))))

(define (compile-dotted exp)
  (string-append (compile (list-ref exp 1))
		 (list-ref exp 0) "(" (compile (list-ref exp 2)) ")"))

The resulting code looks like this. I'm glad I didn't have to write it by hand. It was hard enough to indent!

> (compile (gen-matches-2 patterns #t))

  "(lt >= 3 && t.charAt(lt - 3) == '.' && t.charAt(lt - 2) == 'r' &&
   (t.charAt(lt - 1) == 'a' ||
    t.charAt(lt - 1) == 'm') ||
   (lt >= 4 && t.charAt(lt - 4) == '.' &&
    (t.charAt(lt - 3) == 'r' &&
     (t.charAt(lt - 2) == 'a' && t.charAt(lt - 1) == 'm' ||
      t.charAt(lt - 2) == 'p' && t.charAt(lt - 1) == 'm') ||
     t.charAt(lt - 3) == 's' && t.charAt(lt - 2) == 'm' &&
     t.charAt(lt - 1) == 'i') ||
    lt >= 5 && t.charAt(lt - 5) == '.' && t.charAt(lt - 4) == 's' &&
    t.charAt(lt - 3) == 'm' && t.charAt(lt - 2) == 'i' &&
    t.charAt(lt - 1) == 'l'))"