pythonnlpgensimphrase

Generating Trigrams with Gensim's Phraser Package in Python


I have the following code snippet which I created with the help of this tutorial for unsupervised sentiment analysis purposes:

sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[sent]
sentences[1]

file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(bigram[x]))

file_export.to_csv('cleaned_dataset.csv', index=False)

Since now I want to have bigrams as well as trigrams, I tried it by adjusting it to:

sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
trigram = Phraser(bigram[phrases])
sentences = trigram[sent]
sentences[1]

file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(trigram[x]))

file_export.to_csv('cleaned_dataset.csv', index=False)

But when I run this, I get TypeError: 'int' object is not iterable which I assume refers to my adjustment to trigram = Phraser(bigram[phrases]). I am using gensim 4.1.2. Unfortunately, I have no computer science background and solutions I find online don't help out.


Solution

  • As a general matter, it's best if you include in your question (by later editing if necessary) the entire multiline error message you received, including any 'traceback' showing involved filenames, line-numbers, & lines-of-source-code. That helps potential answerers focus on exactly where things are going wrong.

    Also, beware that many of the tutorials at 'towardsdatascience.com' are of very poor quality. I can't see the exact one you've linked without registering (which I'd rather not do), but from your code excerpts, I already see a few issues of varying severity for what you're trying to do:

    And in general, beware: without a lot of data, the output of any number of Phrases applications may not be strong. And in all cases, it may not 'look right' to human sensibilities – because it's pure statistical, co-occurrence driven. (Though, even if its output looks weird, it will sometimes help on certain info-retrieval/classification tasks, as it manages to create useful new features that are different than the unigrams alone.)

    My suggestions would be:

    (Unfortunately, I can't find an example of two-level Phrases use in the current Gensim docs – I think some old examples were edited-out in doc simplification work. But there are a couple examples of it not being used all-wrong in the project's testing source code – search the file https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_phrases.py for trigram. But remember those aren't best practices, either, as focused minimal tests.)