google-cloud-platformgoogle-cloud-storageavrogsutilavro-tools

Concat Avro files in Google Cloud Storage


I have some big .avro files in the Google Cloud Storage and I want to concat all of them in a single file.

I got

java -jar avro-tools.jar concat

However, as my files are in the google storage path: gs://files.avro I can't concat them by using avro-tools. Any suggestion about how to solve it?


Solution

  • You can use the gsutil compose command. For example:

    gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
    

    Note: For extremely large files and/or very low per-machine bandwidth, you may want to split the file and upload it from multiple machines, and later compose these parts of the file manually.

    On my case I tested it with the following values: foo.txt contains a word Hello and bar.txt contains a word World. Running this command:

    gsutil compose gs://bucket/foo.txt gs://bucket/bar.txt gs://bucket/baz.txt
    

    baz.txt would return:

    Hello
    World
    

    Note: GCS does not support inter-bucket composing.

    Just in case if you're encountering an exception error with regards to integrity checks, run gsutil help crcmod to get an instructions on how to fix it.