rubyparsingtokenizetext-parsing

How do I tokenize this string in Ruby?


I have this string:

%{Children^10 Health "sanitation management"^5}

And I want to convert it to tokenize this into an array of hashes:

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.

Any pointers?


Solution

  • For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

    irb> text = %{Children^10 Health "sanitation management"^5}
    irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
           { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
         end
    #=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]
    

    If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

    A quick breakdown of the regex:

    String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

    The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.