pythonnumpytensorflowkerasragged-tensors

Tensorflow create tf.ragged.constant from a large dataset


Having a DF of 50,000 users each having different count of rows:

                      id            feature_1  ...           feature10  feature11
0  1587712104294-4384584            -0.661835  ...           -1.768028   -0.38924
1  1587712104294-4384584            -0.661835  ...           -1.709090   -0.38924
---- User 2 starts here ----
2  1587712104294-1234584            -0.661835  ...           -1.708693   -0.38924
3  1587712104294-1234584            -0.661835  ...           -1.627594   -0.38924
4  1587712104294-1234584            -0.653476  ...           -1.329767   -0.38924

I'm using the following code to create a tf.ragged.constant:

x_np_values = data.values
# take all columns beside the id column and use the id to group arrays 
X = np.split(x_np_values[:,1:], np.unique(x_np_values[:, 0], return_index=True)[1][1:])
X = tf.ragged.constant(X)

The code removes the id column and creates ragged constant of user rows. However, this only works on small subset of the data. For the entire dataset it takes ages and sometimes crashes my machine.

What would be the proper way to group-by id and create a ragged constant from the rest of the columns?


Solution

  • I found this method to be much faster in creating a ragged constant:

    def get_ragged_constants(data):
        return tf.RaggedTensor.from_row_lengths(
            values=data.values,
            row_lengths=data.groupby('GROUP_ID').size())