pythonpandasdataframenumpysample

Most efficient way to conditionally sample my large df


I have a large DF (~35 million rows) and I am trying to create a new df by randomly sampling two rows from each unique cluster ID (~1.8 million unique cluster IDs)-- one row must have a label 0 and one row must have a label 1 (sometimes there is only one label, so I must check first if both labels are present within the cluster). For reference, my dataset has 3 main cols: 'embeddings', 'cluster_ID', 'label'.

I'm finding that this process takes way more time than I anticipated (more than 15 hours), and I want to know if there is a way to optimize my code.

Please NOTE: My original df only has 2 labels: 1 and 0. I should end up with more than 1.8 million rows. I would like to sample 2 rows for each cluster (one label 1 and other is label 0). Some clusters have a mixed bag of rows with both labels-- but some only have rows with label 1. If the former is the case: I want 2 rows from that cluster (one for each label)-- if the latter is the case: I only want one row from that cluster. Therefore: I should end up with between 1.8-3.6 million rows.

import pandas as pd
import random

# Create an empty list to store selected rows
selected_rows = []

# Iterate over unique cluster IDs
for cluster_id in result_df_30['cluster_ID'].unique():
    # Filter the DataFrame for the current cluster ID
    df_cluster = result_df_30[result_df_30['cluster_ID'] == cluster_id]
    
    # Filter rows with label 0 and 1
    df_label_0 = df_cluster[df_cluster['label'] == 0]
    df_label_1 = df_cluster[df_cluster['label'] == 1]
    
    # Sample rows if they exist
    if not df_label_0.empty:
        sample_label_0 = df_label_0.sample(n=1, random_state=42)
        selected_rows.append(sample_label_0)
    if not df_label_1.empty:
        sample_label_1 = df_label_1.sample(n=1, random_state=42)
        selected_rows.append(sample_label_1)

# Concatenate the selected rows into a single DataFrame
selected_rows_df = pd.concat(selected_rows)

selected_rows_df


Solution

  • Updated requirements: sample 1 row from each cluster-label combination

    sample = (
        df.sample(frac=1)
        .groupby(["cluster_ID", "label"], as_index=False)
        .first()
    )
    

    Original requirements: sample only from clusters with both 0 and 1 labels

    This requires that your dataframe be sequentially indexed (0, 1, ..., 35m). That is super easy with df = df.reset_index(drop=True).

    # Some test data
    n = 35_000_000
    df = pd.DataFrame(
        {
            "embeddings": np.random.rand(n),
            "cluster_ID": np.random.randint(0, 1_800_000, n),
            "label": np.random.randint(0, 2, n),
        }
    )
    
    tmp = (
        # since you only care about rows with label 0 or 1
        df[df["label"].isin([0, 1])]  
        # shuffle the rows. This is where the randomness comes from
        .sample(frac=1) 
        # reset the index
        .reset_index()  
        # for each Cluster ID and label, get the number of the first row
        .pivot_table(index="cluster_ID", columns="label", values="index", aggfunc="first")
    )
    
    # Filter for clusters that have both 0 and 1, then get their row numbers
    idx = tmp[tmp.notna().all(axis=1)].to_numpy("int").flatten()
    
    # and your random sample
    sample = df.iloc[idx]