rubyparslet

Parslet word until delimeter present


I'm just starting with ruby and parslet, so this might be obvious to others (hopefully).

I'm wanting to get all the words up until a delimiter (^) without consuming it

The following rule works (but consumes the delimeter) with a result of {:wrd=>"otherthings"@0, :delim=>"^"@11}

require 'parslet'    
class Mini < Parslet::Parser
      rule(:word) { match('[a-zA-Z]').repeat}
      rule(:delimeter) { str('^') }
      rule(:othercontent) { word.as(:wrd) >> delimeter.as(:delim) }
      root(:othercontent)
end
puts Mini.new.parse("otherthings^")

I was trying to use the 'present?',

require 'parslet' 
class Mini < Parslet::Parser
  rule(:word) { match('[a-zA-Z]').repeat}
  rule(:delimeter) { str('^') }
  rule(:othercontent) { word.as(:wrd) >> delimeter.present? }
  root(:othercontent)
end
puts Mini.new.parse("otherthings^")

but this throws an exception:

Failed to match sequence (wrd:WORD &DELIMETER) at line 1 char 12. (Parslet::ParseFailed)

At a later stage I'll want to inspect the word to the right of the delimeter to build up a more complex grammar which is why I don't want to consume the delimeter.

I'm using parslet 1.5.0.

Thanks for your help!


Solution

  • TL;DR; If you care what is before the "^" you should parse that first.

    --- longer answer ---

    A parser will always consume all the text. If it can't consume everything, then the document is not fully described by the grammar. Rather than thinking of it as something performing "splits" on your text... instead think of it as a clever state machine consuming a stream of text.

    So... as your full grammar needs to consume all the document... when developing your parser, you can't make it to parse some part and leave the rest. You want it to transform your document into a tree so you can manipulate it into it's final from.

    If you really wanted to just consume all text before a delimiter, then you could do something like this...

    Say I was going to parse a '^' separated list of things.

    I could have the following rules

    rule(:thing) { (str("^").absent? >> any).repeat(1) }  # anything that's not a ^
    rule(:list)  { thing >> ( str("^") >> thing).repeat(0) } #^ separated list of things
    

    This would work as follows

    parse("thing1^thing2") #=> "thing1^thing2"
    parse("thing1") #=> "thing1"
    parse("thing1^") #=> ERROR ... nothing after the ^ there should be a 'thing'
    

    This would mean list would match a string that doesn't end or start with an '^'. To be useful however I need to pull out the bits that are the values with the "as" keyword

    rule(:thing) { (str("^").absent? >> any).repeat(1).as(:thing) }
    rule(:list)  { thing >> ( str("^") >> thing).repeat(0) }
    

    Now when list matches a string I get an array of hashes of "things".

    parse("thing1^thing2") #=> [ {:thing=>"thing1"@0} , {:thing=>"thing2"@7} ] 
    

    In reality however you probably care what a 'thing' is... not just anything will go there.

    In that case.. you should start by defining those rules... because you don't want to use the parser to split by "^" then re-parse the strings to work out what they are made of.

    For example:

    parse("6 + 4 ^ 2") 
     # => [ {:thing=>"6 + 4 "@0}, {:thing=>" 2"@7} ]
    

    And I probably want to ignore the white_space around the "thing"s and I probably want to deal with the 6 the + and the 4 all separately. When I do that I am going to have to throw away my "all things that aren't '^'" rule.