pythonpython-3.xk-meansh2o4gpu

Clustering text documents using h2o4gpu K-Means in Python


I'm interested in using h2o4gpu to cluster text documents. For reference, I've followed this tutorial, but have changed the code to reflect h2o4gpu.

from sklearn.feature_extraction.text import TfidfVectorizer
import h2o4gpu

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 2
model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
max_iter=100, n_init=1)
model.fit(X)

However, when running the code sample above, I receive the following errors:

Traceback (most recent call last):
File "dev.py", line 20, in <module>
model.fit(X)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 810, in fit
res = self.model.fit(X, y)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 303, in fit
X_np, _, _, _, _, _ = _get_data(X, ismatrix=True)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 119, in _get_data
data, ismatrix=ismatrix, dtype=dtype, order=order)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 79, in _to_np
outdata = outdata.astype(dtype, copy=False, order=nporder)
ValueError: setting an array element with a sequence.

I've searched for h2o4gpu.feature_extraction.text.TfidfVectorizer, but haven't found it in h2o4gpu. That said, is there a way to rectify this issue?

Software Versions


Solution

  • X = TfidfVectorizer(stop_words='english').fit_transform(documents)

    Returns a sparse matrix object scipy.sparse.csr_matrix.

    Currntly in H2O4GPU we support only dense representations for KMeans. This means you'd have to transform your X into either a 2D Python vanilla list or 2D Numpy array filling out the missing elements with 0.

    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(documents)
    X_dense = X.toarray()
    
    true_k = 2
    model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
    max_iter=100, n_init=1)
    model.fit(X_dense)
    

    Should do the trick. This is not an optimal solution for NLP as it'll probably require a lot more memory but we don't have sparse support for KMeans yet on the roadmap.