This is my first ever blog post, and I’m definitely looking for constructive feedback on how I can improve in future posts. Feel free to let me know in the comments or on Twitter @MaxScheiber!
The new CS student has had a limited exposure to parsing, usually in a procedural language. Here at the University of Pennsylvania, we have a couple of introductory programming assignments that require this, such as reading in a CSV dictionary file in CIS 120 and simulating an object file in CIS 240.
Ostensibly, students are taught that parsing should take place in a giant loop — for example, in Java:
Reader r = new BufferedReader (new FileReader ("file.txt"));
for (int c = r.read (); c != -1; c = r.read ()) {
// do stuff
}
This ingrains into students’ minds that the “right” way to parse files is one token at a time in a giant loop. However, this encourages code repetition and does not really provide a good level of abstraction. We spend too much time worrying about things like ASCII values, maintaining boolean flags, and ensuring proper control flow through nested loops and helper methods. In fact, this is why I think performing any kind of nontrivial file IO is one of the ways we write poorly encapsulated spaghetti code that ends up being a giant state machine.
Enter parser combinators. Parser combinators are parsing functions that are, well, combinations of other parsers. You can think of parser combinators as being recursively built; our base case is a parser that accepts one character, and we can build other parsers from that, like a parser for a word.
A functional programming language such as Haskell allows us to write a very clear parser combinator; in fact, this is how the powerful Haskell parsing library Parsec is built.
Let’s construct a basic parser combinator from scratch. Credit where credit is due — I adapted this from Dr. Stephanie Weirich‘s wonderful CIS 552 course at Penn. I originally learned the material from her lectures. I’m hoping to go into more depth with the explanations here.
Core Library
Good functional programming practices dictate that we keep IO as close to the top of our stack as possible, so we will write this library assuming that we have already read some file into a String. Moreover, let us assume that we are turning said String into some formatted type a. Said a can be anything from a String to an algebraic datatype of our choosing (which we will explore in a future post). Finally, let us assume that there may be multiple possible parses — i.e. we are parsing nondeterministically.
We can encapsulate all of this into a type definition!
newtype Parser a = P (String -> [ (a, String) ])
By our definition, a Parser is a function that takes a String and returns a list of all possible parses. A parse itself is simply our formatted type, plus whatever our parser hasn’t touched yet. Keep in mind that our nondeterministic approach allows us to easily represent a failed parse — the empty list! (If we were to write a deterministic parser, we would have to use an alternative method of representing a failed parsing attempt, such as Maybe or Either.)
We can’t avoid the basic idea of reading in one character at a time, and we even identified earlier that this is essentially the base case of our parser combinator. If we were writing normal Haskell and wanted to grab one character at a time, we would write code like this:
getChar :: String -> Char
getChar (c:_) = c
getChar [] = error "empty String does not have a head"
(This is obviously no different from the head function on generic Lists!) Since we’re in the context of a Parser, we can grab a character without worrying about failing!
getChar :: Parser Char
getChar = P $ \cs -> case cs of
(x:xs) -> [ (x, xs) ]
[] -> []
As you can tell, the getChar Parser takes a character and keeps it. If there’s no possible character to grab, instead of throwing an error, we fail gracefully by returning the empty list.
Now, as implicit in the type definition of Parser, we aren’t actually running anything — a Parser is, internally, a function waiting for a String. We can define a simple runner function to extract the function from the type and apply it to an argument.
parse :: Parser a -> String -> [ (a, String) ]
parse (P p) s = p s
This function is but a simple pattern match on the structure of a Parser. Pretty cool, if you ask me! Given this, we can construct our next core function, an “either” Parser. The premise behind this is that we cannot predict in advance what we are parsing. Take the hypothetical example of parsing code into an abstract syntax tree. We cannot know if we’re parsing, say, an operator versus an atom. As a result, we have to try both parsers! If the first succeeds, we know we’ve parsed an operator. If it fails, we can try the next one safely.
The either Parser, which we will call <|>, is also our first to represent a joining of Parser blocks!
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = P $ \cs -> case parse p1 cs of
[] -> parse p2 cs
ps -> ps
Our type signature makes sense — <|> takes two Parsers and returns a new Parser that will try the second Parser iff the first fails. The implementation itself makes sense — if the first Parser succeeded, return what it gives us. If it failed, return what the second Parser gives us (which may fail itself).
Now, <|> ran one Parser, looked at its result, and (possibly) ran a second Parser. This fits nicely into the bind paradigm, implying that Parsers are a Monad. Indeed, we can define them as such.
instance Monad Parser where
p1 >>= fp2 = P $ \cs -> do (a, cs') <- parse p1 cs
parse (fp2 a) cs'
return x = P $ \cs -> [ (x, cs) ]
fail _ = P $ \_ -> []
bind is always tricky to think about, but as customary with Haskell, we let the types direct us. The type of bind is Parser a -> (a -> Parser b) -> Parser b, and as a reminder, we constructed our Parser as newtype Parser a = P (String -> [ (a, String) ]). The Monad should jump out at us! A parse gives us a formatted result of type a, which we can feed into fp2 to give us a Parser b that we can then run on the remaining String cs'.
Note: we’ve availed ourselves of the List Monad to take care of the nondeterminism; indeed, that is what the List Monad is for! As a result, our very own Parser Monad is itself nondeterministic.
return is a lot easier to reason about and write; to shove an a into a Parser, just pretend that we’ve already parsed it. fail (optional but convenient in Monads) is even easier, since we’ve established earlier that the empty list represents a failed parse.
While we’re at it, we might as well write a Functor on Parsers.
instance Functor Parser where
fmap f p = p >>= \x -> return (f x)
In other words, we can map over a Parser simply by running a Parser p and then mapping the formatted type with f.
That’s actually all we need to do with our core library! We don’t need to ever open up or pattern match on the structure of a Parser again. That’s, to me, the beauty of parser combinators. The simple notion that we can build large Parsers from small ones automatically gives us another level of abstraction and safety!
Combinators
Our next part of this will be writing some specific combinators simply from the convenient base case of getChar and helper case of <|>.
Using a predicate
We may want to parse if a character satisfies some predicate. For example, if we’re parsing code, we might want to specifically parse a semicolon to indicate the end of a statement, or we might want to parse anything that’s alphanumeric to represent a character of a variable name. We will creatively call this function satisfy, and we can use our new Parser Monad to help us out a ton. The basic idea of how satisfy works is that we will eagerly parse a character and check after the fact to see if it matches what we were expecting.
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = do
c <- getChar
if p c then return c else fail "Did not satisfy boolean predicate"
satisfy builds off of getChar. We can use it to build some useful Char checkers. (All boolean functions are imported from Data.Char.)
alpha, digit, upper, lower, space :: Parser Char
alpha = satisfy isAlpha
digit = satisfy isDigit
upper = satisfy isUpper
lower = satisfy isLower
space = satisfy isSpace
We can specifically check for Char equality with a useful combinator:
char :: Char -> Parser Char
char = satisfy . (==) -- non-FPN — char c = satisfy (== c)
Going from Chars to Strings
How do we make the jump from reading a Char to a String? In procedural-style parsing, we would use concatenation or a StringBuilder. In Haskell, we can rely on Monads!
string :: String -> Parser String
string = mapM char
Again, let the types guide us.
mapM :: Monad m => (a -> m b) -> [a] -> m [b]
char :: Char -> Parser Char
string :: String -> Parser String
char fits as the first argument to mapM, where a == b == Char and m == Parser. Thus, the second argument to mapM, an inputted String, returns a Parser String — which is exactly what string tell us!
Intuitively, what we're doing is taking a predicate String and checking each character of the predicate in order — hence, mapping is proper here.
More complex parses of unknown length
Let's say we want to run a specific Parser multiple times. Our gut instinct may be to try this:
manyNaive :: Parser a -> Parser [a]
manyNaive p = do x <- p
xs <- manyNaive p
return (x:xs)
However, once we fail, we will not recursively return anything we've already parsed — we will fail! We thus can't parse anything finite, which is nigh-useless. The way to combat this is with mutually-recursive functions.
-- | Parses zero or more instances of p.
many :: Parser a -> Parser [a]
many p = rest p <|> (return [])
-- | Parses one or more instances of p.
rest :: Parser a -> Parser [a]
rest p = do x <- p
xs <- many p
return (x:xs)
many will try to parse once. If that fails, it will return an empty list. This is NOT the same as failing! Remember, return [] will box the empty list INTO a successful parser. On the other hand, rest will DEFINITELY parse once and WILL return a failure otherwise. If it parses once, it will call on many to search for the possibility of more input.
This is a bit tricky, so let's analyze the control flow a bit more. If we run (many digit) on "14", we'll try to parse a digit in rest. We can parse a '1', so we keep that around and call many on the remaining string, "4". We'll parse the '4' in rest, keep it around, and call many on the empty string, which calls back to rest. That returns a FAILURE back to many. By the nature of the <|> function, we recognize said failure and move to the second branch, which simply returns a "successful" parse that has no characters. We then recursively go up the stack of rest calls, which returns our successfully parsed String of digits.
Converting to other formatting types
Keeping everything in Strings limits us as programmers. Let's say we want to parse an integer from our input, and we want to store it as an Int. We can easily do this using rest.
int :: Parser Int
int = do
s <- string "-" <|> return [] -- picks up the negative sign
d <- rest digit -- guarantees at least one digit, so no naked negative signs
return (read (s ++ d) :: Int)
While the read function is normally unsafe to use because of its potential to throw errors, we've safely guaranteed that we will ONLY ever call it on actual digits thanks to our digit combinator. We will fail gracefully with a Parser error before we ever fail from trying to convert non-digit Chars into Ints.
You can also use algebraic datatype constructors to push bare components, such as Strings and Ints, into more complex types. We will explore this when parsing an assembly language.
Trying many potential Parsers
Real-world use cases may have a ton of nondeterminism. (In the next installment, for example, I will be examining a very basic assembly language, which will have several arithmetic and logical operations, all of which must be eagerly guessed.) It will not be uncommon to need to chain together many Parsers with <|>. To save your eyes, we can construct an unlist function to take care of this for us:
unlist :: [Parser a] -> Parser a
unlist = foldr (<|>) (fail "Could not parse")
Conclusion
You have just seen the basics of monadic parsing via combinators in Haskell as a safer, cleaner, and simpler alternative to procedural parsing. For simplicity's sake, we limited this method of parsing to Strings specifically. In a future post, we will expand our Parser to read from an arbitrary holding type, such as Strings, ByteStrings, and lists of lexed tokens. We will also show how to parse a basic assembly language into an algebraic datatype. For now, I hope that this post was able to make very clear how to construct a monadic parser combinator from scratch, how to use it in unique ways, and how functional programming can make parsing easier to understand.