pythonpandasnumpysvmreshape

reshaping rows and columns and then converting to numpy array


Following DataFrame contains integer values. I want them to reshaped into a new column where every row will be represented, as a combination of each 3 rows of each columns from old dataframe.

import pandas as pd
data = pd.DataFrame({'column1': [123, 456, 789, 321, 654, 987, 1234, 45678],
                     'column2': [123, 456, 789, 321, 654, 987, 1234, 45678]})
data=data.astype(str) #string conv.
n = len(data) // 3 #reshaping to new DF
# Create a new DataFrame without commas
X = pd.DataFrame({
    'vector': [' '.join(data.iloc[i:i+3, :].values.flatten()) for i in range(0, len(data), 3)]
})

Output:
    vector
    0  123 123 456 456 789 789
    1  321 321 654 654 987 987
    2    1234 1234 45678 45678

Now this datframe contans 'str' values. Is it possible to convert this datframe to 'int' again. Beacuse, I want to use this into SVM algorithm as numpy array, where it consider this dataframe as error due to 'str' object. I was unable to convert it to 'int' again, or is there any alternative way to do this?


Solution

  • You can attain the same result in a more idiomatic way by applying a concatenating function to every group formed after splitting the dataframe into n=3 consecutive rows. No need to cast to str in the middle:

    def concat(x):                                                                                                       
        return pd.concat([x.T[c] for c in x.T]).to_list()                                                                
                                                                                                                         
                                                                                                                         
    new = data.groupby(data.index // 3).apply(concat)                                                                    
    print(new)
    

    gives

    0    [123, 123, 456, 456, 789, 789]
    1    [321, 321, 654, 654, 987, 987]
    2        [1234, 1234, 45678, 45678]
    dtype: object
    

    In the resulting dataframe (actually a Series), the value is of the type returned by concat, in my example a list. For other types, convert appropriately, e.g. .to_numpy().