pythonscikit-learndata-sciencecountvectorizerscikits

Transform input to match only exact words of the vocabulary with Count Vectorizer of Sci-Kit


I have a 2d array. Each row of the array is a cooking recipe and each column contains the ingredients of the recipe. I want to create a normalised binary matrix of the ingredients. The normalised binary matrix will have the same number of rows as the recipe matrix ( for every recipe) and a binary vector of all the ingredients in every column. If the ingredient is present in the recipe the element will have a value of 1 if not a value of zero.

Right now the binary matrix has occurrences above 1. That is happening because the count vectorizer matches more than one words in the vocabulary. For example suppose my vocabulary is

{'chicken': 0, 'chicken broth': 1, 'carrots': 2}

and suppose the vector i want to transform is

['chicken','carrots']

the binary matrix will be transformed like this

[2, 0, 1]

while i want it to be

[1,0,1]

that is happening because the 'chicken' is matched with 'chicken' but also matched with 'chicken broth'. Below there is a snippet of my code that produces this. I want to match only exact occurrences of a word in the vocabulary. Are there any parameters or any way that i can use to achieve this? I tried the ngrams parameter without success.

cv = CountVectorizer(vocabulary=unique_igredients,lowercase=False)
taggedSentences = cv.fit_transform(unique_igredients)

#encode document

for i in recipes:
    vector = cv.transform(i)
    mylist = sum(map(numpy.array, vector.toarray()))
    vectorized_matrix_m.append(mylist.tolist())

Solution

  • N-grams can be used to separate the word chicken from chicken broth. N-grams (bi-gram in this case) converts chicken broth (2 distinct tokens) into a single token chicken_broth and hence we can represent the count of ingredients the way it is needed: [1, 0, 1] instead of [2, 0, 1]. Here's an answer link for a similar issue. To implement n-grams with Scikit-learn's CountVectorizer you need to set n_gram_range parameter to the N-grams (bi-grams, tri-grams, ...) needed for the task. For this example, it is n_gram_range=(2) and needs to be increased depending on the maximum word count of the ingredients.

    Note: Do not use a range of N-grams such as n_gram_range=(1,2) which could still cause the token chicken to be counted separately from the bi-gram token chicken_broth.

    Summarizing, you could change the 1st line of code as follows (assuming max_word_count is the maximum word count as described above):

    cv = CountVectorizer(vocabulary=unique_ingredients, lowercase=False, ngram_range=(max_word_count))
    

    Hope this late answer helps!