Fri 23 Sep 2011
A Parsec Full of Rats, Part 2
Posted by Edward Kmett under Algorithms , Data Structures , Haskell , Monads , Parsing , Uncategorized1 Comment
Last time, I showed that we can build a small parsec clone with packrat support.
This time I intend to implement packrat directly on top of Parsec 3.
One of the main topics of discussion when it comes to packrat parsing since Bryan Ford's initial release of Pappy has been the fact that in general you shouldn't use packrat to memoize every rule, and that instead you should apply Amdahl's law to look for the cases where the lookup time is paid back in terms of repetitive evaluation, computation time and the hit rate. This is great news for us, since, we only want to memoize a handful of expensive combinators.
First, we'll need to import enough of Parsec to do something interesting.
{-# LANGUAGE RecordWildCards, ViewPatterns, FlexibleInstances, MultiParamTypeClasses #-} import Text.Parsec import qualified Text.Parsec.Token as T import Text.Parsec.Token (GenLanguageDef(..), GenTokenParser(TokenParser)) import Text.Parsec.Pos (initialPos, updatePosChar) import Data.Functor.Identity (Identity(..)) import Control.Applicative hiding ((< |>)) import Control.Monad.Fix (fix)
Then as before, we'll define PEG-style backtracking:
(< />) :: Monad m => ParsecT s u m a -> ParsecT s u m a -> ParsecT s u m a p < /> q = try p < |> q infixl 3 < />
Now we need an analogue to our Result type from last time, which recalled whether or not we had consumed input, and what the current cursor location is. Fortunately, we can recycle the definitions from Parsec to this end.
type Result d a = Consumed (Reply d () a)
We'll define a combinator to build a parser directly from a field accessor. Last time, this was just the use of the "Rat" constructor. Now it is a bit trickier, because we need to turn Consumed (Reply d () a)
into m (Consumed (m (Reply d u a)))
by wrapping it in the appropriate monad, and giving the user back his state unmolested.
rat :: Monad m => (d -> Result d a) -> ParsecT d u m a rat f = mkPT $ \s0 -> return $ return . patch s0 < $> f (stateInput s0) where patch (State _ _ u) (Ok a (State s p _) err) = Ok a (State s p u) err patch _ (Error e) = Error e
Last time we could go from a parser to a result just by applying the user stream type, but with parsec we also have to supply their notion of a position. This leads to the following combinator. By running in the Identity monad with no user state it should be obvious that we've duplicated the functionality of the previous 'Rat' parser (with the addition of a source position).
womp :: d -> SourcePos -> ParsecT d () Identity a -> Result d a womp d pos p = fmap runIdentity . runIdentity $ runParsecT p (State d pos ())
The combinator is so named because we needed a big space-rat rather than a little pack-rat to keep with the theme.
It's not impossible. I used to bullseye womp rats in my T-16 back home, they're not much bigger than two meters.
Now we'll write a bit of annoyingly verbose boilerplate to convince Parsec
that we really want a LanguageDef
for some monad other than Identity. (As an aside, why Text.Parsec.Language
doesn't contain GenLanguageDefs that are parametric in their choice of Monad is beyond me.)
myLanguageDef :: Monad m => T.GenLanguageDef D u m myLanguageDef = T.LanguageDef { commentStart = "{-" , commentEnd = "-}" , commentLine = "--" , nestedComments = True , identStart = letter < |> char '_' , identLetter = alphaNum < |> oneOf "_'" , opStart = opLetter myLanguageDef , opLetter = oneOf ":!#$%&*+./< =>?@\\^|-~" , reservedOpNames = [] , reservedNames = [] , caseSensitive = True }
As a shameless plug, trifecta offers a particularly nice solution to this problem, breaking up the monolithic Token type into separate concerns and letting you layer parser transformers that enrich the parser to deal with things like Haskell-style layout, literate comments, parsing comments in whitespace, etc.
And as one last bit of boilerplate, we'll abuse RecordWildcards once again to avoid the usual 20 lines of boilerplate that are expected of us, so we can get access to parsec's token parsers.
TokenParser {..} = T.makeTokenParser myLanguageDef
Now we're ready to define our incredibly straightforward stream type:
data D = D { _add :: Result D Integer , _mult :: Result D Integer , _primary :: Result D Integer , _dec :: Result D Integer , _uncons :: Maybe (Char, D) } instance Monad m => Stream D m Char where uncons = return . _uncons
And using the general purpose rat
combinator from earlier, we can write some memoized parsers:
add, mult, primary, dec :: Parsec D u Integer add = rat _add mult = rat _mult primary = rat _primary dec = rat _dec
And finally, we write the code to tie the knot and build the stream:
parse :: SourceName -> String -> D parse n = go (initialPos n) where go p s = fix $ \d -> let (womp d p -> _add) = (+) < $> mult < * reservedOp "+" <*> add < /> mult < ?> "summand" (womp d p -> _mult) = (*) < $> primary < * reservedOp "*" <*> mult < /> primary < ?> "factor" (womp d p -> _primary) = parens add < /> dec < ?> "number" (womp d p -> _dec) = natural _uncons = case s of (x:xs) -> Just (x, go (updatePosChar p x) xs) [] -> Nothing in D { .. } runD :: Parsec D u a -> u -> SourceName -> String -> Either ParseError a runD p u fn s = runParser p u fn (prep fn s)
and finally, let it rip:
eval :: String -> Integer eval s = either (error . show) id $ runD (whiteSpace *> add < * eof) () "-" s
While this approach tends to encourage memoizing fewer combinators than libraries such as frisby, this is exactly what current research suggests you probably should do with packrat parsing!
The other purported advantage of packrat parsers is that they can deal with left recursion in the grammar. However, that is not the case, hidden left recursion in the presence of the algorithm used in the scala parsing combinator libraries leads to incorrect non-left-most parses as shown by Tratt.
I leave it as an exercise for the reader to extend this material with the parsec+iteratees approach from my original talk on trifecta to get packrat parsing of streaming input. Either that or you can wait until it is integrated into trifecta.
You can download the source to this (without the spurious spaces inserted by wordpress) here.
If I can find the time, I hope to spend some time addressing Scott and Johnstone's GLL parsers, which actually achieve the O(n^3) worst case bounds touted for Tomita's GLR algorithm (which is actually O(n^4) as it was originally defined despite the author's claims), and how to encode them in Haskell with an eye towards building a memoizing parser combinator library that can parse LL(1) fragments in O(1), deal with arbitrary context-free grammars in O(n^3), and degrade reasonably gracefully in the presence of context-sensitivity, while supporting hidden left recursion as long as such recursion passes through at least one memoized rule. This is important because CFGs are closed under extensions to the grammar, which is a nice property to have if you want to have a language where you can add new statement types easily without concerning yourself overmuch with the order in which you insert the rules or load the different extensions.
September 30th, 2022 at 12:41 pm
nexus engine…
The Comonad.Reader » A Parsec Full of Rats, Part 2…