pythonpandaslarge-datapyarrowhuggingface-datasets

How to randomly sample very large pyArrow dataset


I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot find a clear way to do this. I've tried converting to a pd.Dataframe so that I can use df.sample(), but python crashes everytime (assuming due to large dataset). So, I'm looking for something built-in within pyarrow.

df = Dataset.from_file("embeddings_job/combined_embeddings_small/data-00000-of-00001.arrow")
df=df.to_table().to_pandas() #crashes at this line
random_sample = df.sample(n=100)

Some ideas: not sure if this is w/replacement

import numpy as np
random_indices = np.random.randint(0, len(df), size=100)
    
    # Take the samples from the dataset
sampled_table = df.select(random_indices)

Using huggingface shuffle

    sample_size = 100
    # Shuffle the dataset
    shuffled_dataset = df.shuffle()
    
    # Select the first 100 rows
    sampled_dataset = df.select(range(sample_size))

Is the only other way through terminal commands? Would this be correct:

for i in {1..30}; do shuf -n 1000 -r file > sampled_$i.txt; done

After getting each chunk, the plan is to run each chunk through a random forest algoritm. What is the best way to go about this?

Also, I would like to note that whatever solution should make sure the indices do not get reset when I get the output subset.


Solution

  • A bit late, but I just had to write a function to randomly sample a pyarrow Table. It produces the sample directly from a pyarrow Table without converting to a pandas dataframe.

    def sample_table(table: pa.Table, n_sample_rows: int = None) -> pa.Table:
        if n_sample_rows is None or n_sample_rows >= table.num_rows:
            return table
    
        indices = random.sample(range(table.num_rows), k=n_sample_rows)
    
        return table.take(indices)