pythonxml

Reading and storing XML data in arrays


Sorry if this question is stupid but I can't seem to get my way through it. I have a .xml file which looks like:


<?xml version="1.0" encoding="utf-8"?>
<tags>
  <row Id="1" TagName="bayesian" Count="1342" ExcerptPostId="20258" WikiPostId="20257" />
  <row Id="2" TagName="prior" Count="168" ExcerptPostId="62158" WikiPostId="62157" />
  <row Id="3" TagName="elicitation" Count="6" />
  <row Id="4" TagName="normality" Count="191" ExcerptPostId="67815" WikiPostId="67814" />
  <row Id="5" TagName="open-source" Count="13" />
  <row Id="6" TagName="distributions" Count="1880" ExcerptPostId="8046" WikiPostId="8045" />
  <row Id="9" TagName="machine-learning" Count="2564" ExcerptPostId="9066" WikiPostId="9065" />
  <row Id="10" TagName="dataset" Count="514" ExcerptPostId="20490" WikiPostId="20489" />
  <row Id="11" TagName="sample" Count="219" ExcerptPostId="28276" WikiPostId="28275" />
  <row Id="12" TagName="population" Count="120" ExcerptPostId="69287" WikiPostId="69286" />
  <row Id="15" TagName="measurement" Count="97" ExcerptPostId="66319" WikiPostId="66318" />
  <row Id="16" TagName="scales" Count="157" />

All I need to do is to read this .xml file and store the data in arrays so that I can analyze that. I do the following steps;

import xml.etree.ElementTree as ET
tree = ET.parse('Tags.xml')
root = tree.getroot()

print root
<Element 'tags' at 0x10365d810>
In [37]: root.attrib
Out[37]: {}

root.getchildren
Out[38]: <bound method Element.getchildren of <Element 'tags' at 0x10365d810>>

In [39]: root.getiterator
Out[39]: <bound method Element.getiterator of <Element 'tags' at 0x10365d810>>

In [40]: root.items
Out[40]: <bound method Element.items of <Element 'tags' at 0x10365d810>>

In [41]: root.keys
Out[41]: <bound method Element.keys of <Element 'tags' at 0x10365d810>>

Somehow, I can't find the step to oread the columns. Thanks for help, I am very new with Python and XML Prakash


Solution

  • Iterate over root to get the children. Each child has a dict member called attrib. In your XML, the .attrib member will contain all of the data you need.

    import xml.etree.ElementTree as ET
    tree = ET.parse('Tags.xml')
    root = tree.getroot()
    
    
    tags = {tag.attrib['TagName']:tag.attrib for tag in root}
    
    print tags['bayesian']['WikiPostId']
    print tags['scales']['Count']