I realized the other day that I’ve only made one post in the “coding” category. I don’t spend as much of my free time writing code as I used to, but I figured I still do enough that it deserves more than one post. So here’s a quick write up about the latest mini-project I’ve been working on.
I’ve been implementing a (very) simple regular expression library in Python. It doesn’t really compare to the standard library’s ‘re’ module or to Perl’s regular expressions, but I’m mainly doing it to learn the underlying algorithms, and those are similar in the more powerful implementations.
To be honest, the project is mostly a rewrite of Haskell code I wrote and forgot about a few years back. I wanted to add new features, but I didn’t want to use Haskell, so I thought I’d rewrite the whole thing in Python. I got a little ahead of myself, though, and started adding new regex syntax before all of the old features were done. The table below (from the Github project Wiki) shows the supported regex syntax in both version:
| Feature |
Example |
Python |
Haskell |
| Concatenation |
abc |
X |
X |
| Alternation |
a|b |
X |
X |
| Grouping |
(ab)* |
X |
X |
| Zero or one |
a? |
X |
|
| Closure (zero or more) |
a* |
X |
X |
| One or more |
a+ |
X |
X |
| Numeric quantifiers |
a{1,3} |
X |
|
| Character classes |
[a-z] |
X |
|
| POSIX named character classes |
[:alpha:] |
X |
|
| Negative character classes |
[^a-z] |
|
|
For most people, the only remotely interesting part of the project are probably the neat finite automata graphs it can produce using GraphViz:

The Wikipedia page for finite automata does a pretty good job explaining what the graph actually means.
There’s only one part of the Haskell code I still need to rewrite: the NFA to DFA conversion routine. I’m embarrassed to admit it, but I’ve been putting it off because my original Haskell code is so bad. I’m tempted to say it’s the worst code I’ve ever written:
-- Convert an Nfa to a Dfa using the algorithms from section 3.7.1 of the "Dragon book"
toDfa :: Nfa -> Dfa
toDfa nfa =
let
dstrt = e_closure nfa (n_start nfa)
init = Map.fromList [(dstrt, genStates dstrt)]
dfa = getps init
stateNames = genStateMap dfa
partial = Map.mapKeys (\x -> lookupDefault x 0 stateNames) dfa
dfaTrans = Map.map (\x -> Map.map (\y -> (lookupDefault y 0 stateNames)) x) partial
accepting = Set.fromList
(Map.elems
(Map.filterWithKey
(\k x ->
(Set.size (Set.intersection (n_accepting nfa) k)) > 0
)
stateNames
)
)
in
Dfa dfaTrans (n_start nfa) accepting (n_rgx nfa)
where
getps old =
if (new == old)
then new
else (getps new)
where
new = generateProductions old
generateProductions :: Map.Map (StateSet) (Map.Map NfaChar (StateSet))
-> Map.Map (StateSet) (Map.Map NfaChar (StateSet))
generateProductions fad = foldl
(\acc x ->
let
gt = (genStates x)
in
Map.insert x gt acc
) fad (getProductionStates fad)
genStateMap :: Map.Map (StateSet) (Map.Map NfaChar (StateSet))
-> Map.Map (StateSet) Int
genStateMap dfa = foldl
(\acc x ->
let
(y, sm) = mapState acc x
in
sm
) Map.empty (Map.keys dfa)
getProductionStates fad = (concat
(map
(\x -> Map.elems x)
(Map.elems fad)
)
)
mapState sm set = case Map.lookup set sm of
Just y -> (y, sm)
Nothing -> let ms = Map.size sm
in (ms, Map.insert set ms sm)
genStates set = Map.fromList (filter (\(x,y) -> not (Set.null y))
(Set.toList
(Set.map
(\ch -> (ch, e_closure_set nfa (nfa_move nfa set ch)))
inSyms
)
))
inSyms = Set.delete Epsilon (Set.fromList (Map.fold (\x acc -> acc ++ Map.keys x) [] (n_transitions nfa)))
I’m actually a bit disappointed in myself for turning the compiler book’s 10 lines of psuedo-code into that mess. All I know is that when I finally get around to writing the Python version, it will be nothing like that.
The latest code is available at GitHub.