I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):
{
"location1": "project:bq_dataset.bq_table1",
#...
"location10": "project:bq_dataset.bq_table10",
"location17": "project:bq_dataset.bq_table17"
}
I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud
CLI like this:
gcloud dataflow jobs run job-201807301630 /
--gcs-location=gs://bucketname/dataprep/dataprep_template /
--parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}
But I'm getting an error:
ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv
From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:
input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...
I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in []
, using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?
I finally found a solution for this via a huge process of trial and error. There are several steps involved.
--parameters
The --parameters
argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping
in the CLI, but in short it means you'll need an =
between --parameters
and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("
):
--parameters=inputLocations="object",outputLocations="object"
Then, the objects need the quotes escaping to avoid ending the value prematurely, so
{"location1":"gcs://bucket/whatever"...
Becomes
{\"location1\":\"gcs://bucket/whatever\"...
Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^
) at the start of the argument and between the key=value pairs:
--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"
I used *
because ;
didn't work - maybe because it marks the end of the CLI command? Who knows.
Note also that the gcloud topic escaping
info says:
In cmd.exe and PowerShell on Windows, ^ is a special character and you must escape it by repeating it. In the following examples, every time you see ^, replace it with ^^^^.
customGcsTempLocation
After all that, I'd forgotten that customGcsTempLocation
needs adding to the key=value pairs in the --parameters
argument. Don't forget to separate it from the others with a *
and enclose it in quote marks again:
...}*customGcsTempLocation="gs://bucket/whatever"
Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.