pythontextclassificationmining

Is there a Python text mining script to classify text with multiple classifications?


Classification of descriptions into categories

I have a problem that involves determining what category a text description falls under. These text descriptions are entered in by users and may contain keywords that can be matched to a specific category. Each category has a set of keywords and phrases that can be matched to. There are about 100 categories. For example, a text description might look like this, “Burlap aisle runner w/borders”, and the category “Fabric” contains the keyword “Burlap”, so that the text description could fall under the category

text description/category

Orange Burlap aisle runner w/borders/Fabric

However, there are a couple of exceptions that make this categorization process more difficult.

First, there are text descriptions that contain keywords that match to multiple categories. For example, a text description could fall under 20 different categories (out of 100) due to having keywords that are the same in the categories. This does not permit the correct categorization of the text description.

For example, a text description that is “Orange Burlap aisle runner w/borders”, would have a keyword “Orang” that falls under the category “Fruit”, while also falling under “Fabric” due to the keyword “Burlap”.

text description/category

Orange Burlap aisle runner w/borders/Fabric, Fruit

Second, there are keywords in the text description that do not match directly to any of the categories. Again, this does not permit the correct categorization of the text description.

For example, a text description that contains the keyword “mouse” does not match directly with the category “Computer Accessory”.

Can anyone suggest an algorithm or python library that can classify text descriptions without direct classification and eliminate multi-classification?

I have broken down the keywords for both the text descriptions and categories, and then matched them.

This was the code I used to match the text description with the categories.

%LivyPy3.pyspark

entries['category']=list(map(lambda i:list(map(categories_list.get,i)),entries['text_description']))

However, from this script there are either multiple categorization or no categorization at all.


Solution

  • I suggest you look up https://skymind.ai/wiki/word2vec, word to vectorized allows for vectorization of phrases and sentence to apply more context to the word. Word to vec models create better word association models.

    I would also search google scholar for papers including NLP AND word2vec AND NIPS AND categorization. This search yielded 4,300+ papers that would give you a lot of direction in solving your problem. If you only want one category to be chosen over all this is a very difficult task. I saw a presentation on #Mailchimps NLP model for classifying client content into categories and sometimes the correct category would literally be the 4th one. The model they created was very well done but still couldn't detect some edge cases and contained some classic biases toward more common categories over the less common.

    https://scholar.google.com/scholar?hl=en&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG= The recommendation engine paper is tied to your task because the complexity of predicting context of small amount of words in order to make a search suggestion is a similar problem.