algorithmnlpword-frequency

Word frequency algorithm for natural language processing


Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

What I'd like:

Reaching for the stars, these would be peachy:

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.


Solution

  • You'll need not one, but several nice algorithms, along the lines of the following.

    I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

    If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

    I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

    Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

    ... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.