azure azure-databricks databricks-connect databricks-sql

Handling spaces in the abfss using COPY INTO with Azure Databricks

I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.

The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:

Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://gpdi-files@hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/

I googled the error and found articles advising to replace the space with "%20". This solution is not effective.

So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.

This is my current Databricks SQL Code:

COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://gpdi-files@hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
    'header'='true', 
    'delimiter'=';', 
    'skipRows'='8', 
    'mergeSchema'='true',  --Whether to infer the schema across multiple files and to merge the schema of each file
    'encoding'='UTF-8', 
    'enforceSchema'='true',  --Whether to forcibly apply the specified or inferred schema to the CSV files
    'ignoreLeadingWhiteSpace'='true',
    'ignoreTrailingWhiteSpace'='true',
    'mode'='PERMISSIVE'  --Parser mode around handling malformed records
)
COPY_OPTIONS (
    'force' = 'true',  --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
    'mergeSchema'= 'true'  --If set to true, the schema can be evolved according to the incoming data.
)

Solution

Trying to use the path where one of the folders has space, gave the same error.

enter image description here

To overcome this, you can specify the folder in PATTERN parameter as follows:

%sql
COPY INTO table1
FROM '/mnt/repro/op/'
FILEFORMAT = csv
PATTERN='has space/sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

enter image description here

Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.

%sql
COPY INTO table2
FROM '/mnt/repro/op/has?space/'
FILEFORMAT = csv
PATTERN='sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

enter image description here

Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.

dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')

%sql
COPY INTO table3
FROM '/FileStore/tables/'
FILEFORMAT = csv
PATTERN='mycsv.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

enter image description here