google-cloud-storagegoogle-genomics

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud


I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)

After I had set up the grid engine, I ran through the samples in the guide.

Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?

Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?

What changes can be made to the grid-engine-tools to make it work?

EDIT

The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.


Solution

  • In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.

    The tar utility will create an archive file it can compress it as well. For example:

    $ # Create a directory with a few input files
    $ mkdir myfiles
    $ echo "This is file1" > myfiles/file1.txt
    $ echo "This is file2" > myfiles/file2.txt
    
    $ # (C)reate a compressed archive
    $ tar cvfz archive.tgz myfiles/*
    a myfiles/file1.txt
    a myfiles/file2.txt
    
    $ # (V)erify the archive
    $ tar tvfz archive.tgz 
    -rw-r--r--  0 myuser mygroup      14 Jul 20 15:19 myfiles/file1.txt
    -rw-r--r--  0 myuser mygroup      14 Jul 20 15:19 myfiles/file2.txt
    

    To extract the contents use:

    $ # E(x)tract the archive contents
    $ tar xvfz archive.tgz 
    x myfiles/file1.txt
    x myfiles/file2.txt
    

    UPDATE:

    In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.

    However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.

    Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.

    A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.

    There is an example that does something similar here:

    https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress

    This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.

    Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.

    -Matt