algorithmoptimizationset-cover

How would I sort through lists of frequently used words to find efficient combinations using the most unique words as possible?


I have lists of the most frequently used words, derived from Google's publicly available ngram data.

I have:

6800 frequent 2grams 4800 frequent 3grams 2500 frequent 4grams 1100 frequent 5grams

an example 2 ngram would be something like:

"the dog" "a book" "three chairs" etc.

an example 5 ngram would be something like: "once upon a time there" "upon a time there was" "it was a dark and" etc.

I also have a list of 2,000 frequent words.

1) I want to find out which combination of the fewest number of ngrams from my various lists contains the most number of words from the frequent word list.

For example, if I found 200 2grams, 40 3 grams, 50 4 grams, and 20 5 grams that used 1800 of the frequent words, that would be a success. I made those ratios up, but I would like to find less than 500 combinations that use the majority of the words.

2) I would also like to find the smallest number of combinations of the various ngrams that contains the highest total amount of words from the lists.

For example, if I could find 500 ngrams that use over 2000 different words, that would be great.

The problem I am having is that I have no idea how I would go about doing this. I think hadoop and mapreduce are in the right direction... but any help would be appreciated!


Solution

  • You have on the order of 15k ngrams. This is an extremely small data set. It will likely fit into 1 MB of memory, probably less than 1/5000 of the total memory on your machine. You don't need hadoop to solve such a problem. Further, it's not really a machine learning problem at all, it's just an optimization problem.

    You could think of your n-grams as (small) sets of words, and your frequent word list as a larger set. For your first problem, you want to pick the fewest number of n-grams such that you can cover (or come as close to covering as possible) the frequent word list with those n-grams. This is exactly a set cover problem. You probably won't get an exact solution, but there are simple heuristics that do well.

    I am not totally clear on how your first problem differs from your second problem, however.