I have loaded a variantset into Cloud Genomics and am attempting to export it to BigQuery. The first approach I tried was to use a pipeline as detailed here:
https://cloud.google.com/genomics/docs/how-tos/load-variants
However, 20 minutes into the process, it failed. According to StackDriver error reporting, it appears to be a problem in the VCF file, though I am at a loss to explain how it might be fixed:
ValueError: Invalid record in VCF file. Error: list index out of range
at next (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:476)
at read_records (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:398)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:48)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:44)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:39)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:38)
at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:167)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:609)
So I continued to search for other options. I turned to the API:
https://cloud.google.com/genomics/reference/rest/v1/variantsets/export
I made sure that my account was a BigQuery admin and an owner for the Genoimcs variantset. I used the following parameters:
{
"projectId": "my-project",
"format": "FORMAT_BIGQUERY",
"bigqueryDataset": "my_dataset",
"bigqueryTable": "new_table"
}
Upon submitting, I receive the following error:
{
"error": {
"code": 500,
"message": "Unknown Error.",
"status": "UNKNOWN"
}
}
I have also tried this from the command line: gcloud alpha genomics variantsets export variantset_id bigquery_table --bigquery-dataset=my-dataset --bigquery-project=my-project
.
But that gives me a 500 Unknown Error as well. I've been going back on this for several hours, and the documentation is quite sparse.
Please, what could I be missing?
It looks like one or more lines in the VCF file are malformed and do not conform to the spec.
We just released a preprocessor/validator tool that shows a report of all such malformed records. Please give it a try: https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/vcf_files_preprocessor.md (please run with --report_all_conflicts
to ensure you get the full report).
If it turns out that only a few records are malformed, then you can either fix them manually in the VCF file or run the vcf_to_bq
pipeline with --allow_malformed_records
, which will skip the malformed ones (just logs them) and load the rest.