I have lists of the most frequently used words, derived from Google's publicly available ngram data.
I have:
6800 frequent 2grams 4800 frequent 3grams 2500 frequent 4grams 1100 frequent 5grams
an example 2 ngram would be something like:
"the dog" "a book" "three chairs" etc.
an example 5 ngram would be something like: "once upon a time there" "upon a time there was" "it was a dark and" etc.
I also have a list of 2,000 frequent words.
1) I want to find out which combination of the fewest number of ngrams from my various lists contains the most number of words from the frequent word list.
For example, if I found 200 2grams, 40 3 grams, 50 4 grams, and 20 5 grams that used 1800 of the frequent words, that would be a success. I made those ratios up, but I would like to find less than 500 combinations that use the majority of the words.
2) I would also like to find the smallest number of combinations of the various ngrams that contains the highest total amount of words from the lists.
For example, if I could find 500 ngrams that use over 2000 different words, that would be great.
The problem I am having is that I have no idea how I would go about doing this. I think hadoop and mapreduce are in the right direction... but any help would be appreciated!
You have on the order of 15k ngrams. This is an extremely small data set. It will likely fit into 1 MB of memory, probably less than 1/5000 of the total memory on your machine. You don't need hadoop to solve such a problem. Further, it's not really a machine learning problem at all, it's just an optimization problem.
You could think of your n-grams as (small) sets of words, and your frequent word list as a larger set. For your first problem, you want to pick the fewest number of n-grams such that you can cover (or come as close to covering as possible) the frequent word list with those n-grams. This is exactly a set cover problem. You probably won't get an exact solution, but there are simple heuristics that do well.
I am not totally clear on how your first problem differs from your second problem, however.