jeremiah blog

  • About

A simple Python regex library

Posted by jl2 on July 20, 2010
Posted in: coding. Tagged: coding, python, regex.

I realized the other day that I’ve only made one post in the “coding” category. I don’t spend as much of my free time writing code as I used to, but I figured I still do enough that it deserves more than one post. So here’s a quick write up about the latest mini-project I’ve been working on.

I’ve been implementing a (very) simple regular expression library in Python. It doesn’t really compare to the standard library’s ‘re’ module or to Perl’s regular expressions, but I’m mainly doing it to learn the underlying algorithms, and those are similar in the more powerful implementations.

To be honest, the project is mostly a rewrite of Haskell code I wrote and forgot about a few years back. I wanted to add new features, but I didn’t want to use Haskell, so I thought I’d rewrite the whole thing in Python. I got a little ahead of myself, though, and started adding new regex syntax before all of the old features were done. The table below (from the Github project Wiki) shows the supported regex syntax in both version:

Feature Example Python Haskell
Concatenation abc X X
Alternation a|b X X
Grouping (ab)* X X
Zero or one a? X  
Closure (zero or more) a* X X
One or more a+ X X
Numeric quantifiers a{1,3} X  
Character classes [a-z] X  
POSIX named character classes [:alpha:] X  
Negative character classes [^a-z]    

For most people, the only remotely interesting part of the project are probably the neat finite automata graphs it can produce using GraphViz:

Regular Expression NFA

The Wikipedia page for finite automata does a pretty good job explaining what the graph actually means.

There’s only one part of the Haskell code I still need to rewrite: the NFA to DFA conversion routine. I’m embarrassed to admit it, but I’ve been putting it off because my original Haskell code is so bad. I’m tempted to say it’s the worst code I’ve ever written:

-- Convert an Nfa to a Dfa using the algorithms from section 3.7.1 of the "Dragon book"
toDfa :: Nfa -> Dfa
toDfa nfa =
    let
        dstrt = e_closure nfa (n_start nfa)
        init = Map.fromList [(dstrt, genStates dstrt)]
        dfa = getps init
        stateNames = genStateMap dfa
        partial = Map.mapKeys (\x -> lookupDefault x 0 stateNames) dfa
        dfaTrans = Map.map (\x -> Map.map (\y -> (lookupDefault y 0 stateNames)) x) partial
        accepting = Set.fromList
                    (Map.elems
                            (Map.filterWithKey
                                    (\k x ->
                                         (Set.size (Set.intersection (n_accepting nfa) k)) > 0
                                    )
                             stateNames
                            )
                    )
    in
      Dfa dfaTrans (n_start nfa) accepting (n_rgx nfa)
    where
      getps old =
          if (new == old)
            then new
            else (getps new)
          where
            new = generateProductions old

      generateProductions :: Map.Map (StateSet) (Map.Map NfaChar (StateSet))
                          -> Map.Map (StateSet) (Map.Map NfaChar (StateSet))
      generateProductions fad = foldl
                                (\acc x ->
                                     let
                                         gt = (genStates x)
                                     in
                                       Map.insert x gt acc
                                ) fad (getProductionStates fad)

      genStateMap :: Map.Map (StateSet) (Map.Map NfaChar (StateSet))
                  -> Map.Map (StateSet) Int
      genStateMap dfa = foldl
                        (\acc x ->
                             let
                                 (y, sm) = mapState acc x
                             in
                               sm
                        ) Map.empty (Map.keys dfa)
      getProductionStates fad = (concat
                                 (map
                                  (\x -> Map.elems x)
                                  (Map.elems fad)
                                 )
                                )

      mapState sm set = case Map.lookup set sm of
                          Just y -> (y, sm)
                          Nothing -> let ms = Map.size sm
                                     in (ms, Map.insert set ms sm)
      genStates set = Map.fromList (filter (\(x,y) -> not (Set.null y))
                      (Set.toList
                              (Set.map
                                      (\ch -> (ch, e_closure_set nfa (nfa_move nfa set ch)))
                                      inSyms
                              )
                      ))
      inSyms = Set.delete Epsilon (Set.fromList (Map.fold (\x acc -> acc ++ Map.keys x) [] (n_transitions nfa)))
view raw gistfile1.hsThis Gist brought to you by GitHub.

I’m actually a bit disappointed in myself for turning the compiler book’s 10 lines of psuedo-code into that mess. All I know is that when I finally get around to writing the Python version, it will be nothing like that.

The latest code is available at GitHub.

Posts navigation

← Pacific and Atlantic Peaks
Mt. Elbert Again →
  • My Other Stuff

    • 14ers.com
    • Github
    • Last.fm
    • Pinboard
    • Resume (xhtml)
    • SmugMug
    • Twitter
    • Vimeo
  • Friends

    • Gimlin
    • Robert Tadlock
Proudly powered by WordPress Theme: Parament by Automattic.