pythonpandasnumpymatrixdtw

Trying to convert pandas df to np array, dtaidistance computes list instead


I am attempting to compute the distance matrix for an ndarray that I have converted from pandas. I tried to convert the pandas df currently in this format:

move_df = 
        movement
0       [4, 3, 6, 2]
1       [5, 2, 3, 6, 2]
2       [4, 7, 2, 3, 6, 1]
3       [4, 4, 4, 3]
...     ...
33410   [2, 6, 3, 1, 8]
[33410 x 1 columns]

to a numpy ndarray by using the following:

1) m = move_df.to_numpy() 
2) m = pd.DataFrame(move_df.tolist()).values
3) m = [move_df.tolist() for i in move_df.columns]

Each of these conversions resulted in a numpy array in this format:

[[list([4, 3, 6, 2])]
 [list([5, 2, 3, 6, 2])]
 [list([4, 7, 2, 3, 6, 1])]
 [list([4, 4, 4, 3])]
 ...
 [list([2, 6, 3, 1, 8])]]

So when I try to run dtaidistance matrix, I get the following error:

d_m = dtw.distance_matrix(m)

TypeError: unsupported operand type(s) for -: 'list' and 'list'

But when I create a list of lists by copying and pasting several of the numpy arrays created with any of the methods mentioned above, the code works. But this is not feasible in the long run since the arrays are over 30k rows. Is there something I am doing wrong in the conversion from pandas df to numpy array? I used

print(type(m)) 

and it outputs that it is a numpy array and I already know that I cannot subtract a list from a list, hence the error.

EDIT:
For move_df.head(10).to_dict()

{'movement': {0: [4, 3, 6, 2], 
  1: [5, 2, 3, 6, 2], 
  2: [4, 7, 2, 3, 6, 1], 
  3: [4, 4, 4, 3], 
  4: [3, 6, 2, 3, 3], 
  5: [6, 2, 1], 
  6: [1, 1, 1, 1],
  7: [7, 2, 3, 1, 1],
  8: [7, 2, 3, 2, 1],
  9: [6, 2, 3, 1]}}

Solution

  • (one of the dtaidistance authors here)

    The dtaidistance package expects one of three formats:

    In your case you could do:

    series = move_df['movement'].to_list()
    dtw.distance_matrix(series)
    

    which works then on a list of lists.

    To use the fast C implementation an array is required (either Numpy or std lib array). If you want to keep different lengths you can do

    series = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double)).to_list()
    dtw.distance_matrix_fast(series)
    

    Note that it might make sense to do the apply operation inplace on your move_df datastructure such that you only have to do it once and not keep track of two nearly identical datastructures. After you do this, the to_list call is sufficient. Thus:

    move_df['movement'] = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double))
    series = move_df['movement'].to_list()
    dtw.distance_matrix_fast(series)
    

    If you want to use a 2D numpy matrix, you would need to truncate or pad all series to be the same length as is explained in other answers (for dtw padding is more common to not lose information).

    ps. This assumes you want to do univariate DTW, the ndim subpackage for multivariate time series expects a different datastructure.