pythonscikit-learntsne

TSNE: ValueError: setting an array element with a sequence


I'm trying to pass a numpy array to TSNE in order compress that to 2 columns and after that plotting with seaborn. result is a dataframe that i've read from a csv.

arr=result.to_numpy()
n_components = 2
tsne = TSNE(n_components).fit_transform(arr)
arr.shape

arr's output is like this

'00012_0' array([0.21321961620469082, 0.9433962264150944, 20.0, 0.0, 0.0, 0.0, 0.1984126984126984, 0.014925373134328358, 0.0], dtype=object) 'Resnet' 'Lime' 'Real']

I get the following errors:

TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 30>()
     28 #comprimo con TSNE a due colonne
     29 n_components = 2
---> 30 tsne = TSNE(n_components).fit_transform(arr)
     31 arr.shape

File ~\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:1108, in TSNE.fit_transform(self, X, y)
   1088 def fit_transform(self, X, y=None):
   1089     """Fit X into an embedded space and return that transformed output.
   1090 
   1091     Parameters
   (...)
   1106         Embedding of the training data in low-dimensional space.
   1107     """
-> 1108     embedding = self._fit(X)
   1109     self.embedding_ = embedding
   1110     return self.embedding_

File ~\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:830, in TSNE._fit(self, X, skip_num_points)
    819     warnings.warn(
    820         "'square_distances' has been introduced in 0.24 to help phase "
    821         "out legacy squaring behavior. The 'legacy' setting will be "
   (...)
    827         FutureWarning,
    828     )
    829 if self.method == "barnes_hut":
--> 830     X = self._validate_data(
    831         X,
    832         accept_sparse=["csr"],
    833         ensure_min_samples=2,
    834         dtype=[np.float32, np.float64],
    835     )
    836 else:
    837     X = self._validate_data(
    838         X, accept_sparse=["csr", "csc", "coo"], dtype=[np.float32, np.float64]
    839     )

File ~\anaconda3\lib\site-packages\sklearn\base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    564     raise ValueError("Validation should be done on X, y or both.")
    565 elif not no_val_X and no_val_y:
--> 566     X = check_array(X, **check_params)
    567     out = X
    568 elif no_val_X and not no_val_y:

File ~\anaconda3\lib\site-packages\sklearn\utils\validation.py:746, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744         array = array.astype(dtype, casting="unsafe", copy=False)
    745     else:
--> 746         array = np.asarray(array, order=order, dtype=dtype)
    747 except ComplexWarning as complex_warning:
    748     raise ValueError(
    749         "Complex data not supported\n{}\n".format(array)
    750     ) from complex_warning

ValueError: setting an array element with a sequence.

I understand that it might be that i'm passing a sequence of values to a single slot but i don't know how change it in order to make it work


Solution

  • You are right. TSNE will break if you try to pass an array as one element. You should transform all of the values as numbers before passing to TSNE.

    Basically if one row has values

    ['00012_0', array([0.21321961620469082, 0.9433962264150944, 20.0, 0.0, 0.0, 0.0, 0.1984126984126984, 0.014925373134328358, 0.0], dtype=object), 'Resnet', 'Lime', 'Real']

    You should process it into something like

    [0, 0.21321961620469082, 0.9433962264150944, 20.0, 0.0, 0.0, 0.0, 0.1984126984126984, 0.014925373134328358, 0.0, 0, 0, 0]

    where categorical variables have been one-hot-encoded. You can also use some consideration and if there are some variables that are related to id or are constant for the whole data, they can be left out.