pythonpandasnumpycsv

Can't get grouped data into numpy array


I have a CSV file like this:

Ngày(Date),Số(Number)
07/03/2025,8
07/03/2025,9
...
06/03/2025,6
06/03/2025,10
06/03/2025,18
06/03/2025,14
...

(Each day has 27 numbers)

I want to predict a list of 27 numbers on the next day using LSTM. It keeps getting an error on this step:

data_matrix = np.array(grouped_data.loc[:, "Số"].tolist())

with

KeyError: 'Số'

(which means 'Number')

Here is my code:

import numpy as np
import pandas as pd

df = pd.read_csv("C:/Users/Admin/lonum_fixed.csv", encoding="utf-8", sep=",")
df.columns = df.columns.str.strip()

grouped_data = df.groupby("Ngày")[["Số"]].apply(lambda x: list(map(int, x["Số"].values))).reset_index()
grouped_data["Số"] = grouped_data["Số"].apply(lambda x: eval(x) if isinstance(x, str) else x)

data_matrix = np.array(grouped_data.loc[:, "Số"].tolist())

Solution

  • First: when it reads data then it should convert values to integers so there is no need to use map(int, ...). And apply( ...list ...) creates lists so there is no need to use eval().


    Problem is because groupby().apply() created DataFrame with name 0 instead of "Số"and later it raised error in grouped_data["Số"].apply(...), not grouped_data.loc[:, "Số"]

    You can reduce code to

    grouped_data = df.groupby("Ngày")["Số"].apply(list).reset_index(name="Số")
    

    which will convert to list and set name "Số" again. I uses ["Số"] instead of [["Số"]]

    Because pandas keep data as numpy.array so you can get

    data_matrix = grouped_data["Số"].values
    

    Full code used for tests:

    I used io.StringIO only to create file-like object in memory - so everyone can simply copy and run it - but you can use filename.

    import numpy as np
    import pandas as pd
    
    
    text = '''Ngày,Số
    07/03/2025,8
    07/03/2025,9
    06/03/2025,6
    06/03/2025,10
    06/03/2025,18
    06/03/2025,14
    '''
    
    import io
    
    df = pd.read_csv(io.StringIO(text), encoding="utf-8", sep=",")
    #df = pd.read_csv("C:/Users/Admin/lonum_fixed.csv", encoding="utf-8", sep=",")
    df.columns = df.columns.str.strip()
    print('----')
    print(df)
    print('----')
    print(df.dtypes)
    
    grouped_data = df.groupby("Ngày")["Số"].apply(list).reset_index(name="Số")
    print('---')
    print(grouped_data)
    print('----')
    print('type:', type(grouped_data))
    
    print('---')
    print('type:', type(grouped_data["Số"].values))
    print('----')
    print('values  :', grouped_data["Số"].values)
    print('np.array:', np.array(grouped_data["Số"]))
    
    data_matrix = grouped_data["Số"].values
    #data_matrix = np.array(grouped_data["Số"])
    
    print('----')
    print('data_matrix:', data_matrix)
    

    Result:

    ----
             Ngày  Số
    0  07/03/2025   8
    1  07/03/2025   9
    2  06/03/2025   6
    3  06/03/2025  10
    4  06/03/2025  18
    5  06/03/2025  14
    ----
    Ngày    object
    Số       int64
    dtype: object
    ---
             Ngày               Số
    0  06/03/2025  [6, 10, 18, 14]
    1  07/03/2025           [8, 9]
    ----
    type: <class 'pandas.core.frame.DataFrame'>
    ---
    type: <class 'numpy.ndarray'>
    ----
    values  : [list([6, 10, 18, 14]) list([8, 9])]
    np.array: [list([6, 10, 18, 14]) list([8, 9])]
    ----
    data_matrix: [list([6, 10, 18, 14]) list([8, 9])]