rubylucenefrequencyxapianword-frequency

How to count all phrases efficiently in a large collection?


I need to create a phrase frequency table, counting all phrases in a very large collection of a few million words words. The end result would be a table such as what is created here: http://www.hermetic.ch/wfca/phrases.htm

What would be an efficient algorithm to implement this? It would be even better to see it implemented in Ruby if you're able to show some specifics. Or, frankly, I'm even open to using xapian or lucene, but not seeing an immediate way to accomplish this with these in terms of building a frequency table output as desired.


Solution

  • I would recommend using a hash with the words as keys, and incrementing the values as you find each phrase.

    Ruby is built for data manipulation of this sort, so you're coming at it from the right direction.

    I'm not going to to the project for you, but take a look closely at:

    http://ruby-doc.org/core-2.0/Hash.html

    And then understand the basic regexes you'd need to parse:

    http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ

    http://rubular.com/

    Edit: I believe in more recent ruby, hashes are sortable! I bet this would help with your table output. I'm not sure how Ruby implements that (efficiently?), however.