daskdask-ml

Can you use dask_ml kmeans on a dask array?


I have the following code:

feature_array = da.concatenate(features, axis=1)#.compute()
model = KMeans(n_clusters=4)
model.fit(features, y=None)

Now if I compute feature_array first this code runs just fine, but without it it gives some internal TypeError that I can't really figure out:

File "/Users/(...)/lib/python3.7/site-packages/dask_ml/utils.py", line 168, in check_array
    sample = np.ones(shape=shape, dtype=array.dtype)
  File "/Users/(...)/lib/python3.7/site-packages/numpy/core/numeric.py", line 207, in ones
    a = empty(shape, dtype, order)
TypeError: 'float' object cannot be interpreted as an integer

Am I not supposed to use a dask array with dask_ml? The main reason why I want to use dask_ml is that I want this code to be able to run with larger than memory datasets.

Cheers, Florian


Solution

  • It works ok for me

    In [1]: from dask_ml.cluster import KMeans                                      
    
    In [2]: import dask.array as da                                                 
    
    In [3]: x = da.random.random((10, 3))                                           
    
    In [4]: k = KMeans(n_clusters=3)                                                
    
    In [5]: k.fit(x)                                                                
    Out[5]: 
    KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
           max_iter=300, n_clusters=3, n_jobs=1, oversampling_factor=2,
           precompute_distances='auto', random_state=None, tol=0.0001)
    

    I recommend providing an MCVE

    Also, you're providing a Numpy array, not a Dask array.