parsinghaskellnltkearley-parser

Using the Earley library to parse with features and unification


The Earley parsing library is great for writing linguistic parsers in Haskell. CFGs can be specified in an intuitive way, and there is excellent support for backtracking and ambiguity. A simple example:

{-# LANGUAGE OverloadedStrings #-}

import Text.Earley

np = rule ("John" <|> "Mary")
vp = rule ("runs" <|> "walks")

sentence = do
  subj <- np
  pred <- vp
  return $ (++) <$> subj <*> pred

sentence can be used to parse ["John", "runs"] or ["Mary", "walks"], among other inputs.

It would be nice to be able to use Earley to write parsers for FCFGs, where nonterminals are complexes of a label and a feature bundle, and feature matching can happen via unification (for example, the Earley parser in NLTK parses FCFGs). However, it is not clear how to do this using Earley, or whether it can even be done. An example of something we might want in something like BNF:

np[sg] ::= "John" | "Mary"

np[?x] ::= det n[?x]
n[pl]  ::= "boys" | "girls"

det    ::= "the"

vp[sg] ::= "runs" | "walks"
vp[pl] ::= "run"  | "walk"

s ::= np[?x] vp[?x]

Under this FCFG, ["John", "runs"] is an s (since their number features match, as required by the s rule), and ["the", "boys", "walks"] isn't an s (since ["the", "boys"] parses to np[pl] and ["walks"] parses to vp[sg]).

One can in general rewrite an FCFG into an equivalent CFG, but this can be highly inconvenient, and result in a blowup of the grammar, especially when we have many possible features ranging over many possible values.


Solution

  • You're not actually doing any particularly interesting unification here, so perhaps it's enough to toss a very simple nondeterminism applicative of your own into the mix. The standard one is [], but for this case, even Maybe looks like enough. Like this:

    {-# Language OverloadedStrings #-}
    {-# Language TypeApplications #-}
    
    import Control.Applicative
    import Control.Monad
    import Data.Foldable
    import Text.Earley
    
    data Feature = SG | PL deriving (Eq, Ord, Read, Show)
    
    (=:=) :: (Feature, a) -> (Feature, b) -> Maybe (a, b)
    (fa, a) =:= (fb, b) = (a, b) <$ guard (fa == fb)
    
    data NP = Name String | Determined String String deriving (Eq, Ord, Read, Show)
    
    np :: Grammar r (Prod r e String (Feature, NP))
    np = rule . asum $
        [ fmap (\name -> (SG, Name name)) ("John" <|> "Mary")
        , liftA2 (\det n -> (PL, Determined det n)) "the" ("boys" <|> "girls")
        ]
    
    vp :: Grammar r (Prod r e String (Feature, String))
    vp = rule . asum $
        [ (,) SG <$> ("runs" <|> "walks")
        , (,) PL <$> ("run" <|> "walk")
        ]
    
    s :: Grammar r (Prod r e String (Maybe (NP, String)))
    s = liftA2 (liftA2 (=:=)) np vp
    
    test :: [String] -> IO ()
    test = print . allParses @() (parser s)
    

    Try it out in ghci:

    > sequence_ [test (words n ++ [v]) | n <- ["John", "the boys"], v <- ["walks", "walk"]]
    ([(Just (Name "John","walks"),2)],Report {position = 2, expected = [], unconsumed = []})
    ([(Nothing,2)],Report {position = 2, expected = [], unconsumed = []})
    ([(Nothing,3)],Report {position = 3, expected = [], unconsumed = []})
    ([(Just (Determined "the" "boys","walk"),3)],Report {position = 3, expected = [], unconsumed = []})
    

    So, the result needs a bit of interpretation -- a successful parse of Nothing really counts as a failed parse -- but perhaps that's not so bad? Not sure. Certainly it's unfortunate that you don't get to reuse Earley's error-reporting and nondeterminism machinery. Probably to get either thing, you'd have to fork Earley.

    If you need to do real unification you could look into returning a IntBindingT t Identity instead of a Maybe, but at least until your features are themselves recursive this is probably enough and much, much simpler.