pythonmachine-learningcluster-analysis

How to group dataframes to get a subset that represents the full range of the larger set


These are 2 examples of a bunch of dataframes I have:

days p1 p2 p3
4 2.1 3.4 4.5
15 2.2 3.6 2.8
39 2.5 2.1 0.4

and this:

days p1 p2 p3
4 2.1 3.4 4.5
18 8.2 2.2 5.8
22 6.4 3.6 1.4
29 2.4 4.1 2.3

I have around 1 million of these dataframes (same columns, differing lengths) and I want to output about a subset of 50000 that is a fair represent all the different dataframes that exist. Basically the dataframes should be a valid representation so training an ML model on the full 1 million, or the 50k subset should give the ML model almost the same behaviour.

The number of days is important as 2 dataframes with the same param (p) values but a vastly different days column are not equal

My approach idea is to group dataframes together by a variable for each level. Then take 1 dataframe from each group in the bottom level.

Group Level 1 (GL1): group the dataframes by the number of rows.

Group Level 2 (GL2): For each dataframe in GL1, group dataframes that have a similar days column using clustering analysis (DBSCAN clustering?)

Group Level 3 (GL3): For each dataframe in GL2, group dataframes together with similar param values using clustering analysis (DBSCAN clustering?)

Take 1 dataframe from each GL3 group to represent that group of dataframes.

It may not get the full max and mins for each params but this method seems like it will be quite encompassing. Is this a good idea or do you have any better ideas?


Solution

  • The idea is in order, however you could apply this approach:

    1. Group Level 1 - Group by number of rows: This is straight - forward and effective. It helps ensures that the sample includes dataframes of different lengths, which could represent various intervals or scales.
    2. Group Level 2 - Group by days using DBSCAN: DBSCAN clustering for days is a good choice because of its robustness for non-uniform data distribution, including arbitrary shapes.
    3. Group Level 3 - Group by parameter values using DBSCAN: Your choice on this is also good. DBSCAN will capture clusters of similar parameter values. Feature scaling be applied (you may try MinMaxScaler) to ensure that each parameter p1, p2, p3 are treated equally in terms of clustering influence. Note that DBSCAN is sensitive to the scale of the input data.
    4. Group Level 4 - Sampling from GL3 Groups: Use Random Sampling for GL3 group to ascertain broad representation. In order to prevent oversampling from certain clusters, diversity could be a choice. To achieve this, you could stratify with additional metadata (you could possibly use avearage or range of days within each GL3 group).