pythonpandasspacy

Output of for loop filling down in dataframe instead of returning corresponding values for each row


I'm using SpaCy to process a series of sentences and return the five most common words in each sentence. My goal is to store the output of that frequency analysis (using Counter) in a column beside each corresponding sentence. I think this is just the lack of coffee and sleep talking here, but I'm stuck on why this keeps outputting a dataframe that has the first value filling all the way down (and repeating) instead of unique values that match the output for the sentence itself.

Code:

# test_data is a Dataframe with three columns: a unique identifier, a title, and a sentence for each title. #

for value in test_data['desc']: # for each sentence in dataset
    desc = nlp(value) # run spacy natural language processing on the description
    words = [
        token.text # for each token, etc
        for token in desc
        if not token.is_stop and not token.is_punct # essentially, just keywords, no filler
    ]
    keys = list(Counter(words).most_common(5)) # store values from Counter 
    key_list = ", ".join(map(str, keys)) # convert list to string
    test_data['key'] = key_list # carry list over to dataframe

The output I'm getting is something like:

uniq title desc key
1 Title one... Sentence one.. ('kword1', 12), ('kword2', 8), ('kword3', 7)
2 Title two... Sentence two... ('kword1', 12), ('kword2', 8), ('kword3', 7)
3 Title three... Sentence three... ('kword1', 12), ('kword2', 8), ('kword3', 7)
4 Title four ... Sentence four... ('kword1', 12), ('kword2', 8), ('kword3', 7)

Where kword1, 2 and 3 all are perfect for the first row (eg, it's the correct output for Sentence One), but duplicated across all rows filling down (not the correct output for Sentence two, three, four).

I'm not sure if this makes any sense and I'm a bit of a Python novice without a comp sci background/foundation so I am all ears for help. Thank you in advance!!


Solution

  • Your mistake is here:

    test_data['key'] = key_list
    

    You rewrite the entire column on each iteration.

    You can use a function and let Pandas create the columns :

    def count5(row):
        desc = nlp(row)
        words = [token.text for token in desc  if not token.is_stop and not token.is_punct]
        keys = list(Counter(words).most_common(5))
        key_list = ", ".join(map(str, keys))
        return key_list
        
    test_data["key"] = test_data["desc"].map(count5)
    

    Output:

    >>> test_data
                                                    desc                                                key
    0  Recent years have brought a revolution in the ...  ('languages', 2), ('Recent', 1), ('years', 1),...
    1  The latest AI models are unlocking these areas...  ('latest', 1), ('AI', 1), ('models', 1), ('unl...
    2  The examples of NLP use cases in everyday live...  ('examples', 1), ('NLP', 1), ('use', 1), ('cas...
    3  Natural language processing algorithms emphasi...  ('Natural', 1), ('language', 1), ('processing'...
    4  The outline of NLP examples in real world for ...  ('translation', 3), ('outline', 1), ('NLP', 1)...