sqlapache-spark-sqldatabricksdatabricks-sql

Spark SQL XML Parsing


I am trying to parse XML and extract node value within XML using Spark-SQL ,

Consider below XML

<Account><ID1><other><ID>123</ID></other></ID1></Account>

Here I am trying to extract 123 using XPath but not getting expected Result.

I appreciated any help here.


Solution

  • to parse XML and extract node values in azure databricks using spark sql, you can use xpath_string function. example:

    from pyspark.sql import SparkSession
    
    # Initialize Spark session
    spark = SparkSession.builder \
        .appName("XML Parsing with Spark SQL") \
        .getOrCreate()
    
    # Sample data with XML content
    data = [
        ('<Account><ID1><other><ID>123</ID></other></ID1></Account>',),
        ('<Account><ID1><other><ID>456</ID></other></ID1></Account>',)
    ]
    
    # Define column names
    columns = ['xml_data']
    
    # Create DataFrame
    df = spark.createDataFrame(data, columns)
    
    # Show the DataFrame content
    df.show(truncate=False)
    
    # Register the DataFrame as a temp view
    df.createOrReplaceTempView("xml_table")
    
    # Execute SQL query to extract values using XPath
    result = spark.sql("""
        SELECT 
            xpath_string(xml_data, 'Account/ID1/other/ID/text()') AS extracted_id 
        FROM 
            xml_table
    """)
    
    # Show the results
    result.show(truncate=False)
    
    # Stop the Spark session
    spark.stop()
    

    output:

    extracted_id
    123
    456