In Microsoft Azure Databricks my column contains an XML string, and I want to pull out the href where the body is ltr__::url if that exists. I tried this:
SELECT xpath_string(my_column, 'string(//a[text()="ltr__::url"][1]/@href)')
FROM ...
It gives an error saying
Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'
If I use an online XPath evaluator with that xpath it parses correctly.
Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'
ERROR due to the specific XPath
functions or syntax supported by Databricks.
I have tried the below approach:
from pyspark.sql.functions import regexp_extract
xml_df = spark.read.format("delta").load(source_path)
pattern = r'<a\s+href="([^"]+)">ltr__::url</a>'
xml_df = xml_df.withColumn("href", regexp_extract("xml_string", pattern, 1))
display(xml_df)
Results:
id xml_string href
4 <a href="http://example.io">ltr__::url</a><a href="http://example.xyz">other__::url</a> http://example.io
2 <a href="http://example.org">other__::url</a>
5 <a href="http://example.dev">not_matching</a>
1 <a href="http://example.com">ltr__::url</a> http://example.com
3 <a href="http://example.net">ltr__::url</a> http://example.net
In the above code reading the data from the Delta table
I have defined a regex pattern
to extract href when the text is "ltr__::url"
& extracting href using regexp_extract