javalucenehadoopmahouttf-idf

Mahout TFIDF Dictionary File


I am trying to perform TFIDF on a set of documents (as text files) using mahout to do to calculations, following this guide.

I have successfully created the dictionary and vector weights, and am now attempting to access the output. In the guide it says you "can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values."

I am not sure how to go about loading in this file to a map as he suggests, does anybody know how it is done?

I created my vectors from a directory of text files, one issue I encountered when running "./mahout seq2sparse..." was the -a flag that controls the analyser - which should be lucene's StandardAnalyzer. When trying to run with this flag I received a ClassNotFoundException, but removing the flag solved the problem and I think the default analyser is also this one anyway, therefore the output should be the same as the example.

If anybody knows how to load this dictionary into a map I will be eternally grateful!

James


Solution

  • I worked it out, so I am putting this up for anyone that comes across this on google.

            SequenceFile.Reader read = new SequenceFile.Reader(fs, new Path("<path do dictionary>"), conf);
            IntWritable dicKey = new IntWritable();
            Text text = new Text();
            Map<Integer, String> dictionaryMap = new HashMap();
            while (read.next(text, dicKey)) {
                dictionaryMap.put(Integer.parseInt(dicKey.toString()), text.toString());
            }
            read.close();
    

    This worked for me, allowing me to read the mapping of id to text in my dictionary file from mahout.