pythonnumpycsvparquethdf5

Python large-scale data format for distributed writing, reading, and storing on AWS


I'm trying to figure out the best way to write, read, and store data on AWS using a conventional Python data format. From my various Googling, I wasn't able to find a conclusive list of Big O notation run-times for adding operations. The best thing I could find was a textual overview of the popular file formats, where the author comes to the conclusion that no one format covers all use cases.

My use case is as follows: , but here's the specific use case I'm looking at:

The datasets I'm processing will probably have somewhere between 2 million to 8 million rows (meaning there will be 2M to 8M keys). And the file formats I'm considering as of now are:

In my current work flow, I was saving individual NPY files, which is definitely a workaround to the distributed-ness of my problem. But transferring all the data to and from AWS becomes cumbersome when I have to do recursive uploads/downloads.

Would appreciate any advice on the best approach here! Similar to the above, I'd also be curious what would be the best format if the data I wanted to save was a JSON object.


Solution

  • Here's what I would suggest:

    When creating your embeddings, use your current solution, and create lots of little .npy files.

    Once you've created your embeddings, and you know how many there are, pack them into two files.

    The first file is a .npy file with an array containing every embedding, of shape (N, 200). With memory mapping, you can create arrays much bigger than memory.

    The second file is a shelf. This file maps the "sample_id" key into an integer. The integer represents an index into the array.

    When you want to find an embedding, do this:

    array[shelf[sample_id]]
    

    i.e. look up the sample_id in the shelf, then look up the index of the embedding in the array.

    If you memory-map the npy file into memory, and use the shelf, you can load single embeddings into memory without having the entire dataset in memory. However, it would require downloading the entire file from S3 before using it.