nlpmachine-learningextractinformation-extraction

Extracting user interests from social profiles


This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:

"I consider myself a pretty diverse character... I'm a professional wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at "Armageddon." I'll head bang to AC/DC, and I'm seriously considering getting a Legend of Zelda tattoo. I'm 420-friendly. I like to party it up with the frat crowd one night, hang out with my Burning Man friends the next, play Halo and World of Warcraft the next, and jam with friends that aren't any younger than 40 the next. My youngest friend is 16, my oldest friend is 66. I'll sing karaoke at the bars, and I'm my friends' collective psychiatrist/shoulder."

The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.

My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.

I have 2 questions:

  1. What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
  2. How hard is this problem?

Thanks!


Solution

  • First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.

    For instance:

    I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.

    This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.

    Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.