pythonxmlwikimedia-dumps

Reading XML file tags


I want to read the tag values like <title>,<title_id> from xml file. The value of <title> read successfully. Is it possible to read the <title>,<title_id> with same loop?
Please help me I'm new to XML.

        <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">
      <siteinfo>
        <sitename>Wiki</sitename>
        <case>first-letter</case>
        <namespaces>
          <namespace key="0" case="first-letter" />
        </namespaces>
      </siteinfo>
      <page>
        <title>Sex</title>
        <title_id>31239628</title_id>
        <revision>
          <id>437708703</id>
          <timestamp>2011-07-04T13:53:52Z</timestamp>
          <text xml:space="preserve" bytes="6830">{{ Hello}}

    </text>
        </revision>
      </page>
    </mediawiki>

I'm using following code to read all the title from file. And its working fine.

import xml.etree.cElementTree as etree
tree = etree.parse('find_title.xml')
for value in tree.getiterator(tag='title'):
    print value.text

Solution

  • If you are going to be working with XML a lot, I'd suggest you familiarise yourself with XPATH.

    Here's a quick snippet using my XML library of preference, lxml.

    from lxml import etree
    
    doc = etree.XML("""
    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">
      <siteinfo>
        <sitename>Wiki</sitename>
        <case>first-letter</case>
        <namespaces>
          <namespace key="0" case="first-letter" />
        </namespaces>
      </siteinfo>
      <page>
        <title>Sex</title>
        <title_id>31239628</title_id>
        <revision>
          <id>437708703</id>
          <timestamp>2011-07-04T13:53:52Z</timestamp>
          <text xml:space="preserve" bytes="6830">{{ Hello}}
          </text>
        </revision>
      </page>
    </mediawiki>
    """)
    
    def first(seq,default=None):
      for item in seq:
        return item
      return default
    
    NSMAP=dict(mw="http://www.mediawiki.org/xml/export-0.5/")
    
    print first(doc.xpath('/mw:mediawiki/mw:page/mw:title/text()',namespaces=NSMAP))
    print first(doc.xpath('/mw:mediawiki/mw:page/mw:title_id/text()',namespaces=NSMAP))
    

    Yields:

    Sex
    31239628
    

    Update - supposing multiple page elements

    XPATH queries mostly return node sequences (hence the first function).

    You could use a single query that returned the values of both tags for all of the pages. You would then have to group them together, if a subelement was missing from a page you'd be out of step. You could write the query to ensure the subelements existed, but you might want to know that there was a partial record, etc, etc.

    So my first answer to this would be to loop through the pages like so:

    for i,page in enumerate(doc.xpath('/mw:mediawiki/mw:page',namespaces=NSMAP)):
      title = first(page.xpath('./mw:title/text()',namespaces=NSMAP))
      title_id = first(page.xpath('./mw:title_id/text()',namespaces=NSMAP))
      print "Page %s: %s (%s)"  % (i,title,title_id)
    

    Yielding:

    Page 0: Sex (31239628)