pythondataframeclassification

Perform classification in R on Python where each data frame is labelled


My problem is, instead of having each row in a data frame corresponding to a label, I have multiple data frames each with the same columns and number of rows but each data frame is labelled say l1, l2 or l3. You need all the data in the data frame to be able to determine the label.

For example, say I have this data frame and its labelled l1 and imagine I have multiple more labelled l1, l2 or l3. I need to create a classification model so when I have a new data frame like this, it can classify it.


Time    Measure1  Measure2      
 1         10       1000  
 2         12       1245  
 3         20       1837  
 4         18       1837  

How can this be structured in R or Python?

I hope that's clear!


Solution

  • You have the correct idea: for a classification model to work, you need to have the data for a single sample in a single row of your resulting dataframe. What you have in your example is some sort of cross table, but what you need is flat table. Luckily with pandas, you can easily create the flat table by using unstack():

    >>> df = pd.DataFrame([[1, 10, 1000], [2, 12, 1245], [3, 20, 1837], [4, 18, 1837]],
                          columns=['Time', 'Measure1', 'Measure2'])
    >>> s = df.set_index('Time').unstack()
    >>> s
              Time
    Measure1  1         10
              2         12
              3         20
              4         18
    Measure2  1       1000
              2       1245
              3       1837
              4       1837
    dtype: int64
    

    The result is a pd.Series (= single column) with a MultiIndex. You can then add the label to the measurements and concatenate all data into a single dataframe.

    >>> s['label'] = 'l1'
    >>> df = pd.concat([s,s,s], axis=1).T
    >>> df
         Measure1             Measure2                   label
    Time        1   2   3   4        1     2     3     4      
    0          10  12  20  18     1000  1245  1837  1837    l1
    1          10  12  20  18     1000  1245  1837  1837    l1
    2          10  12  20  18     1000  1245  1837  1837    l1
    

    Using a MultiIndex in the columns is a bit unwieldy, but you could replace them by

    >>> df.columns = ['_'.join(str(x) for x in c).strip('_') for c in df.columns]
    >>> df
      Measure1_1 Measure1_2 Measure1_3  ... Measure2_3 Measure2_4 label
    0         10         12         20  ...       1837       1837    l1
    1         10         12         20  ...       1837       1837    l1
    2         10         12         20  ...       1837       1837    l1