[SOLVED] Databricks xpath_string not allowing quotes

Databricks xpath_string not allowing quotes

In Microsoft Azure Databricks my column contains an XML string, and I want to pull out the href where the body is ltr__::url if that exists. I tried this:

SELECT xpath_string(my_column, 'string(//a[text()="ltr__::url"][1]/@href)')
FROM ...

It gives an error saying

Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'

If I use an online XPath evaluator with that xpath it parses correctly.

Solution

Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'

ERROR due to the specific XPath functions or syntax supported by Databricks.

I have tried the below approach:

from pyspark.sql.functions import regexp_extract
xml_df = spark.read.format("delta").load(source_path)
pattern = r'<a\s+href="([^"]+)">ltr__::url</a>'
xml_df = xml_df.withColumn("href", regexp_extract("xml_string", pattern, 1))
display(xml_df)

Results:

id  xml_string  href
4   <a href="http://example.io">ltr__::url</a><a href="http://example.xyz">other__::url</a> http://example.io
2   <a href="http://example.org">other__::url</a>   
5   <a href="http://example.dev">not_matching</a>   
1   <a href="http://example.com">ltr__::url</a> http://example.com
3   <a href="http://example.net">ltr__::url</a> http://example.net

In the above code reading the data from the Delta table I have defined a regex pattern to extract href when the text is "ltr__::url" & extracting href using regexp_extract