I have the following code snippet which I created with the help of this tutorial for unsupervised sentiment analysis purposes:
sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[sent]
sentences[1]
file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(bigram[x]))
file_export.to_csv('cleaned_dataset.csv', index=False)
Since now I want to have bigrams as well as trigrams, I tried it by adjusting it to:
sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
trigram = Phraser(bigram[phrases])
sentences = trigram[sent]
sentences[1]
file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(trigram[x]))
file_export.to_csv('cleaned_dataset.csv', index=False)
But when I run this, I get TypeError: 'int' object is not iterable
which I assume refers to my adjustment to trigram = Phraser(bigram[phrases])
. I am using gensim 4.1.2
.
Unfortunately, I have no computer science background and solutions I find online don't help out.
As a general matter, it's best if you include in your question (by later editing if necessary) the entire multiline error message you received, including any 'traceback' showing involved filenames, line-numbers, & lines-of-source-code. That helps potential answerers focus on exactly where things are going wrong.
Also, beware that many of the tutorials at 'towardsdatascience.com' are of very poor quality. I can't see the exact one you've linked without registering (which I'd rather not do), but from your code excerpts, I already see a few issues of varying severity for what you're trying to do:
Phrases
algorithm more than once, to compose up phrases longes than bigrams, you can't reuse the model trained for bigrams. You need to train a new model for each new level-of-combination, on the output of the prior model. That is, the input to the trigrams Phrases
model (which must be trained) for trigrams must be the results of applying the bigram model, so it sees a mixture of the original unigrams & now-combined bigrams.min_count=1
on these sorts of data-hungry models can easily backfire. They need many examples for their statistical-methods to do anything sensible; discarding the rarest words usually helps to speed processing, shrink the models, & work mainly on tokens where there's enough examples to do something possibly sensible. (With very few, or only 1, usage examples, results may seem somewhat random/arbitrary.)Phraser
utiity class – which just exists to optimized the Phrases
model a bit, when you're sure you're done training/tuning – has been renamed FrozenPhrases
. (The old name still works, but this is an indication the tutorial hasn't been recently refreshed.)And in general, beware: without a lot of data, the output of any number of Phrases
applications may not be strong. And in all cases, it may not 'look right' to human sensibilities – because it's pure statistical, co-occurrence driven. (Though, even if its output looks weird, it will sometimes help on certain info-retrieval/classification tasks, as it manages to create useful new features that are different than the unigrams alone.)
My suggestions would be:
Phrases
combinations after things are working without, so you can compare results & see if it's helping.Phrases
is initialized with the already-bigram-combined texts.(Unfortunately, I can't find an example of two-level Phrases
use in the current Gensim docs – I think some old examples were edited-out in doc simplification work. But there are a couple examples of it not being used all-wrong in the project's testing source code – search the file https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_phrases.py for trigram
. But remember those aren't best practices, either, as focused minimal tests.)