databricksazure-databricksdatabricks-sql

Databricks xpath_string not allowing quotes


In Microsoft Azure Databricks my column contains an XML string, and I want to pull out the href where the body is ltr__::url if that exists. I tried this:

SELECT xpath_string(my_column, 'string(//a[text()="ltr__::url"][1]/@href)')
FROM ...

It gives an error saying

Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'

If I use an online XPath evaluator with that xpath it parses correctly.


Solution

  • Error loading expression 'string(//a[text()="ltr__::url"][1]/@href)'

    ERROR due to the specific XPath functions or syntax supported by Databricks.

    I have tried the below approach:

    from pyspark.sql.functions import regexp_extract
    xml_df = spark.read.format("delta").load(source_path)
    pattern = r'<a\s+href="([^"]+)">ltr__::url</a>'
    xml_df = xml_df.withColumn("href", regexp_extract("xml_string", pattern, 1))
    display(xml_df)
    

    Results:

    id  xml_string  href
    4   <a href="http://example.io">ltr__::url</a><a href="http://example.xyz">other__::url</a> http://example.io
    2   <a href="http://example.org">other__::url</a>   
    5   <a href="http://example.dev">not_matching</a>   
    1   <a href="http://example.com">ltr__::url</a> http://example.com
    3   <a href="http://example.net">ltr__::url</a> http://example.net
    

    In the above code reading the data from the Delta table I have defined a regex pattern to extract href when the text is "ltr__::url" & extracting href using regexp_extract