These are 2 examples of a bunch of dataframes I have:
days | p1 | p2 | p3 |
---|---|---|---|
4 | 2.1 | 3.4 | 4.5 |
15 | 2.2 | 3.6 | 2.8 |
39 | 2.5 | 2.1 | 0.4 |
and this:
days | p1 | p2 | p3 |
---|---|---|---|
4 | 2.1 | 3.4 | 4.5 |
18 | 8.2 | 2.2 | 5.8 |
22 | 6.4 | 3.6 | 1.4 |
29 | 2.4 | 4.1 | 2.3 |
I have around 1 million of these dataframes (same columns, differing lengths) and I want to output about a subset of 50000 that is a fair represent all the different dataframes that exist. Basically the dataframes should be a valid representation so training an ML model on the full 1 million, or the 50k subset should give the ML model almost the same behaviour.
The number of days is important as 2 dataframes with the same param (p) values but a vastly different days column are not equal
My approach idea is to group dataframes together by a variable for each level. Then take 1 dataframe from each group in the bottom level.
Group Level 1 (GL1): group the dataframes by the number of rows.
Group Level 2 (GL2): For each dataframe in GL1, group dataframes that have a similar days column using clustering analysis (DBSCAN clustering?)
Group Level 3 (GL3): For each dataframe in GL2, group dataframes together with similar param values using clustering analysis (DBSCAN clustering?)
Take 1 dataframe from each GL3 group to represent that group of dataframes.
It may not get the full max and mins for each params but this method seems like it will be quite encompassing. Is this a good idea or do you have any better ideas?
The idea is in order, however you could apply this approach: