google-bigquerygoogle-genomics

How is it possible to export a Cloud Genomics variantset to BigQuery now that varientsets.export has been deprecated?


I have loaded a variantset into Cloud Genomics and am attempting to export it to BigQuery. The first approach I tried was to use a pipeline as detailed here:

https://cloud.google.com/genomics/docs/how-tos/load-variants

However, 20 minutes into the process, it failed. According to StackDriver error reporting, it appears to be a problem in the VCF file, though I am at a loss to explain how it might be fixed:

ValueError: Invalid record in VCF file. Error: list index out of range
at next (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:476)
at read_records (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:398)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:48)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:44)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:39)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:38)
at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:167)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:609)

So I continued to search for other options. I turned to the API:

https://cloud.google.com/genomics/reference/rest/v1/variantsets/export

I made sure that my account was a BigQuery admin and an owner for the Genoimcs variantset. I used the following parameters:

{
  "projectId": "my-project",
  "format": "FORMAT_BIGQUERY",
  "bigqueryDataset": "my_dataset",
  "bigqueryTable": "new_table"
}

Upon submitting, I receive the following error:

{
  "error": {
    "code": 500,
    "message": "Unknown Error.",
    "status": "UNKNOWN"
  }
}

I have also tried this from the command line: gcloud alpha genomics variantsets export variantset_id bigquery_table --bigquery-dataset=my-dataset --bigquery-project=my-project.

But that gives me a 500 Unknown Error as well. I've been going back on this for several hours, and the documentation is quite sparse.

Please, what could I be missing?


Solution

  • It looks like one or more lines in the VCF file are malformed and do not conform to the spec.

    We just released a preprocessor/validator tool that shows a report of all such malformed records. Please give it a try: https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/vcf_files_preprocessor.md (please run with --report_all_conflicts to ensure you get the full report).

    If it turns out that only a few records are malformed, then you can either fix them manually in the VCF file or run the vcf_to_bq pipeline with --allow_malformed_records, which will skip the malformed ones (just logs them) and load the rest.