azurepysparkserializationdatabricksazure-data-lake-gen2

Is there a way to write the content (stored in a spark Dataframe) of images into files in parallel with pyspark?


I have a Spark Dataframe which contains in every single row two items: a file name (with an extension, like for instance .jpg), and the content of the file in bytes. I would like to write a process that takes each row of the Dataframe and converts the bytes to a '.jpg' image while I store it into an ADLS container.

Everything needs to run inside a Databricks cluster so I use pyspark in order to create the Dataframe, and I would like to make use of it to truncate those files into the destination.

However, I am having trouble when I use the azure-storage library to write those files by using it inside a map function. Like the following example, where the function consume_row uses the library to create the file and write the content like this example:

results_rdd = rdd.map(lambda row: consume_row(row, ...))

It returns the following error:

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread._local' object

Has anyone tried to do anything similar to this?


Solution

  • The problem was inside the function consume_row. We were using a variable to store the token for the API which, underneath, was using a Thread Local Python object to store the token itself, and this was not serializable by pickle to be sent to workers. So we just needed to pass the token to the function and everything is working perfectly.