My problem is, instead of having each row in a data frame corresponding to a label, I have multiple data frames each with the same columns and number of rows but each data frame is labelled say l1, l2 or l3. You need all the data in the data frame to be able to determine the label.
For example, say I have this data frame and its labelled l1 and imagine I have multiple more labelled l1, l2 or l3. I need to create a classification model so when I have a new data frame like this, it can classify it.
Time Measure1 Measure2
1 10 1000
2 12 1245
3 20 1837
4 18 1837
How can this be structured in R or Python?
I hope that's clear!
You have the correct idea: for a classification model to work, you need to have the data for a single sample in a single row of your resulting dataframe. What you have in your example is some sort of cross table, but what you need is flat table. Luckily with pandas, you can easily create the flat table by using unstack()
:
>>> df = pd.DataFrame([[1, 10, 1000], [2, 12, 1245], [3, 20, 1837], [4, 18, 1837]],
columns=['Time', 'Measure1', 'Measure2'])
>>> s = df.set_index('Time').unstack()
>>> s
Time
Measure1 1 10
2 12
3 20
4 18
Measure2 1 1000
2 1245
3 1837
4 1837
dtype: int64
The result is a pd.Series
(= single column) with a MultiIndex. You can then add the label to the measurements and concatenate all data into a single dataframe.
>>> s['label'] = 'l1'
>>> df = pd.concat([s,s,s], axis=1).T
>>> df
Measure1 Measure2 label
Time 1 2 3 4 1 2 3 4
0 10 12 20 18 1000 1245 1837 1837 l1
1 10 12 20 18 1000 1245 1837 1837 l1
2 10 12 20 18 1000 1245 1837 1837 l1
Using a MultiIndex in the columns is a bit unwieldy, but you could replace them by
>>> df.columns = ['_'.join(str(x) for x in c).strip('_') for c in df.columns]
>>> df
Measure1_1 Measure1_2 Measure1_3 ... Measure2_3 Measure2_4 label
0 10 12 20 ... 1837 1837 l1
1 10 12 20 ... 1837 1837 l1
2 10 12 20 ... 1837 1837 l1