twitternlptext-extractionnltktext-analysis

tag generation from a small text content (such as tweets)


I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).

And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.

With this constrain(working on small set of texts), how can I generate tags ?

Regards


Solution

  • Two Stage Approach for Multiword Tags

    You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.

    For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.

    Tweet Level PMI for Single Word Tags

    As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.

    PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet)) 
    

    Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.

    General Changes for Tweets

    Some changes you might want to make when tagging with tweets include: