The code reads a CSV of 628360 rows from GCS, applies a transformation to the created Dataframe with the method withColumn and writes to a partitioned Bigquery table.
Despite this simple workflow the job took 19h 42min hours to be processed. What can I do to process this faster?
I am using an Autoscaling Policy and I know it is not scaling up because there is no Yarn Memory Pending as you can see in the following screenshot.
The configuration of the cluster is the following:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT_ID_PROCESSING \
--region $REGION \
--image-version 2.0-ubuntu18 \
--num-masters 1 \
--master-machine-type n2d-standard-2 \
--master-boot-disk-size 100GB \
--confidential-compute \
--num-workers 4 \
--worker-machine-type n2d-standard-2 \
--worker-boot-disk-size 100GB \
--secondary-worker-boot-disk-size 100GB \
--autoscaling-policy $AUTOSCALING_POLICY \
--secondary-worker-type=non-preemptible \
--subnet $SUBNET \
--no-address \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--shielded-vtpm \
--labels label\
--gce-pd-kms-key $KMS_KEY \
--service-account $SERVICE_ACCOUNT \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--zone "" \
--max-idle 3600s
As it was discussed in Google Dataproc Pyspark - BigQuery connector is super slow I processed the same job without the deidentifier transformation
udf_deidentifier = udf(
lambda x: deidentify(
content=x,
project_id_processing=args.project_id_processing,
)
if x is not None and x != ""
else None,
StringType(),
)
deidentified_df = transformed_df.withColumn(
colName="col1", col=udf_deidentifier("col1")
).withColumn(colName="col2", col=udf_deidentifier("col2"))
it took 23 seconds to process a file with approximately 20.000 rows. I conclude this was the transformation that was delaying the job but I still don't know if I should use withColumn
method.