azureapache-sparkdatabricks

How can I read a XML file Azure Databricks Spark


I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW. Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported.. So any help pushing me a a good direction is appreciated.


Solution

  • I found this one is really helpful. https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb

    he has a youtube to walk through the steps as well.

    in summary, 2 approaches:

    1. install in your databricks cluster at the 'library' tab.
    2. install it via launching spark-shell in the notebook itself.