I am trying to get the doc2vec function to work in python 3. I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function. The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work. the model trains on my data (1000 documents) and build a vocab which is tagged. The documentation I mainly have used is this: Gensim dokumentation Torturial
I hope that some one can help. If any additional info is need please let me know. best Niels
If you're getting ValueError: '10' is not in list
, you can rely on the fact that '10'
is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs()
is ever called, and thus unclear what form tekstdata
is in when provided to Doc2Vec
. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags
you are supplying to TaggedDocument
are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags
of '10'
, it will be seen as ['1', '0']
– and len(vectorModel.doctags)
will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
Doc2Vec
, where most published results use tens-of-thousands to millions of documentsiter
of 10-20 is more common in Doc2Vec
work (and even larger values might be helpful with smaller datasets)infer_vector()
often works better with non-default values in its optional parameters, especially a steps
that's much larger (20-200) or a starting alpha
that's more like the bulk-training default (0.025
)