I'm using SpaCy to process a series of sentences and return the five most common words in each sentence. My goal is to store the output of that frequency analysis (using Counter) in a column beside each corresponding sentence. I think this is just the lack of coffee and sleep talking here, but I'm stuck on why this keeps outputting a dataframe that has the first value filling all the way down (and repeating) instead of unique values that match the output for the sentence itself.
Code:
# test_data is a Dataframe with three columns: a unique identifier, a title, and a sentence for each title. #
for value in test_data['desc']: # for each sentence in dataset
desc = nlp(value) # run spacy natural language processing on the description
words = [
token.text # for each token, etc
for token in desc
if not token.is_stop and not token.is_punct # essentially, just keywords, no filler
]
keys = list(Counter(words).most_common(5)) # store values from Counter
key_list = ", ".join(map(str, keys)) # convert list to string
test_data['key'] = key_list # carry list over to dataframe
The output I'm getting is something like:
| uniq | title | desc | key |
|---|---|---|---|
| 1 | Title one... | Sentence one.. | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 2 | Title two... | Sentence two... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 3 | Title three... | Sentence three... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 4 | Title four ... | Sentence four... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
Where kword1, 2 and 3 all are perfect for the first row (eg, it's the correct output for Sentence One), but duplicated across all rows filling down (not the correct output for Sentence two, three, four).
I'm not sure if this makes any sense and I'm a bit of a Python novice without a comp sci background/foundation so I am all ears for help. Thank you in advance!!
Your mistake is here:
test_data['key'] = key_list
You rewrite the entire column on each iteration.
You can use a function and let Pandas create the columns :
def count5(row):
desc = nlp(row)
words = [token.text for token in desc if not token.is_stop and not token.is_punct]
keys = list(Counter(words).most_common(5))
key_list = ", ".join(map(str, keys))
return key_list
test_data["key"] = test_data["desc"].map(count5)
Output:
>>> test_data
desc key
0 Recent years have brought a revolution in the ... ('languages', 2), ('Recent', 1), ('years', 1),...
1 The latest AI models are unlocking these areas... ('latest', 1), ('AI', 1), ('models', 1), ('unl...
2 The examples of NLP use cases in everyday live... ('examples', 1), ('NLP', 1), ('use', 1), ('cas...
3 Natural language processing algorithms emphasi... ('Natural', 1), ('language', 1), ('processing'...
4 The outline of NLP examples in real world for ... ('translation', 3), ('outline', 1), ('NLP', 1)...