google-cloud-platformfilesystemsgcsfusedistributed-filesystemgoogle-genomics

Access random line in large file on Google Cloud Storage


I'm trying to read a random line out of a large file stored in a public cloud storage bucket.

My understanding is that I can't do this with gsutil and have looked into FUSE but am not sure it will fill my use case: https://cloud.google.com/storage/docs/gcs-fuse

There are many files, which are ~50GB each -- for a total of several terabytes. If possible I would like to avoid downloading these files. They are all plain text files -- you can see them here: https://console.cloud.google.com/storage/browser/genomics-public-data/linkage-disequilibrium/1000-genomes-phase-3/ldCutoff0.4_window1MB

It would be great if I could simply get a filesystem handle using FUSE so I could place the data directly into other scripts -- but I am okay with having to re-write them to read line by line if that is what is necessary. The key thing is -- under no circumstances should the interface download the entire file.


Solution

  • The Range header allows you to download specific byte offsets from within a file using the XML API.

    There is no direct way to retrieve a specific line, as GCS doesn't know where in the file any given line begins/ends. Tools to find a specific line generally read a whole file in order to count line-breaks to find the desired line.

    If the file has line-numbers in it then you could do a binary search to look for the desired line. You would requesting small chunks, check the line number, and then try a different location based on that until you find the desired line.

    if the file doesn't have line-numbers, you could do pre-processing to make it possible. Before the initial file upload, you could scan the file and record the byte location of each Nth line. Then to get the desired line, you look up the byte location in that index and can make a range request for the relevant section.