My batch processing pipeline in Azure
has the following scenario: I am using the copy activity
in Azure Data Factory
to unzip thousands of zip files, stored in a blob storage container
. These zip
files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior
option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity
as:
And then provided the folder path to store the generated logs (in txt
format), which have the following structure:
Timestamp | Level | OperationName | OperationItem | Message |
---|---|---|---|---|
2022-03-01 15:14:06.9880973 | Info | FileWrite | "deviceA/component1/2022.zip/measurements_01.csv" | "Complete writing file. File is successfully copied." |
I want to read the content of these logs in an R
notebook in Azure DataBricks
, in order to get the complete paths for these csv
files for processing. The command I used, read.df
is part of SparkR
library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity
is of append blob
type. read.df()
can read block blobs
without any issue.
From the above scenario, how can I read these logs successfully into my R
session in DataBricks
?
According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob
type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob
type. You can now read this file without any issue from databricks.
Using blob storage for logging:
Using ADLS gen2 for logging: