pythonpandasxpathreadxml

pandas.read_xml() unexpected behaviour


I am trying to understand why the code:

import pandas

xml = '''
<ROOT>
  <ELEM atr="anything">1</ELEM>
  <ELEM atr="anything">2</ELEM>
  <ELEM atr="anything">3</ELEM>
  <ELEM atr="anything">4</ELEM>
  <ELEM atr="anything">5</ELEM>
  <ELEM atr="anything">6</ELEM>
  <ELEM atr="anything">7</ELEM>
  <ELEM atr="anything">8</ELEM>
  <ELEM atr="anything">9</ELEM>
  <ELEM atr="anything">10</ELEM>
</ROOT>
'''
df = pandas.read_xml(xml, xpath='/ROOT/ELEM')
print(df.to_string())

... works as expected and prints:

        atr  ELEM
0  anything     1
1  anything     2
2  anything     3
3  anything     4
4  anything     5
5  anything     6
6  anything     7
7  anything     8
8  anything     9
9  anything    10

Yet the following code:

import pandas

xml = '''
<ROOT>
  <ELEM>1</ELEM>
  <ELEM>2</ELEM>
  <ELEM>3</ELEM>
  <ELEM>4</ELEM>
  <ELEM>5</ELEM>
  <ELEM>6</ELEM>
  <ELEM>7</ELEM>
  <ELEM>8</ELEM>
  <ELEM>9</ELEM>
  <ELEM>10</ELEM>
</ROOT>
'''
df = pandas.read_xml(xml, xpath='/ROOT/ELEM')
print(df.to_string())

results in the error:

ValueError: xpath does not return any nodes or attributes. Be sure to
specify in `xpath` the parent nodes of children and attributes to
parse. If document uses namespaces denoted with xmlns, be sure to
define namespaces and use them in xpath.

I have read the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html

And also checked my xpath here (code above is just a minimal example, actual XML I use is more complex): https://freeonlineformatter.com/xpath-validator/

In a nutshell I need to read into pandas dataframe a list of XML child elements at a known xpath. Child elements have no attributes but all have text values. I want to get a dataframe with one column containing these valyes. What am I doing wrong?


Solution

  • If you check the documentation, pandas expects the XML to have rows with columns. In your first example, each <ELEM> is a row, and the atr is the column. In your second example, there are no columns. If you had <ELEM><VAL>1</VAL></ELEM>, it should work, because VAL would be the column.

    https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html