I have a 60000 documents which i processed in gensim
and got a 60000*300 matrix. I exported this as a csv
file. When i import this in ELKI
environment and run Kmeans
clustering, i am getting below error.
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
This sounds strange, but i found the solution to this issue by opening the exported CSV
file and doing Save As
and saving again as a CSV
file. While size of the original file is 437MB, the second file is 163MB. I used the numpy function np.savetxt
for saving the doc2vec
vector. So it seems to be a Python
issue instead of being ELKI
issue.
Edit: Above solution is not useful. I instead exported the doc2vec
output which was created using gensim
library and while exporting format of the values were decided explicitly as %1.22e
. i.e. the values exported are in exponential format and values have length of 22. Below is the entire line of code.
textVect = model.docvecs.doctag_syn0
np.savetxt('D:\Backup\expo22.csv',textVect,delimiter=',',fmt=('%1.22e'))
CSV
file thus created runs without any issue in ELKI environment.