pythonxmlregexopml

How to use a regex to find quoted attribute values in an OPML (XML) file


I am searching through an OPML file that looks something like this. I want to pull out the outline text and the xmlUrl.

  <outline text="lol">
  <outline text="Discourse on the Otter" xmlUrl="http://discourseontheotter.tumblr.com/rss" htmlUrl="http://discourseontheotter.tumblr.com/"/>
  <outline text="fedoras of okc" xmlUrl="http://fedorasofokc.tumblr.com/rss" htmlUrl="http://fedorasofokc.tumblr.com/"/>
  </outline>

My function:

 import re
 rssName = 'outline text="(.*?)"'
 rssUrl =  'xmlUrl="(.*?)"'

 def rssSearch():
     doc = open('ttrss.txt')
     for line in doc:
        if "xmlUrl" in line:
            mName = re.search(rssName, line)
            mUrl = re.search(rssUrl, line)
            if mName is not None:
                print mName.group()
                print mUrl.group()

However, the return values come out as:

 outline text="fedoras of okc"
 xmlUrl="http://fedorasofokc.tumblr.com/rss"

What is the proper regex expression for rssName and rssUrl so that I return only the string between the quotes?


Solution

  • Don't use regular expressions to parse XML. The code is messy, and there are too many things that can go wrong.

    For example, what if your OPML provider happens to reformat their output like this:

    <outline text="lol">
      <outline
          htmlUrl="http://discourseontheotter.tumblr.com/"
          xmlUrl="http://discourseontheotter.tumblr.com/rss"
          text="Discourse on the Otter"
      />
      <outline
          htmlUrl="http://fedorasofokc.tumblr.com/"
          xmlUrl="http://fedorasofokc.tumblr.com/rss"
          text="fedoras of okc"
      />
    </outline>
    

    That's perfectly valid, and it means exactly the same thing. But the line-oriented search and regular expressions like 'outline text="(.*?)"' will break.

    Instead, use an XML parser. Your code will be cleaner, simpler, and more reliable:

    import xml.etree.cElementTree as ET
    
    root = ET.parse('ttrss.txt').getroot()
    for outline in root.iter('outline'):
        text = outline.get('text')
        xmlUrl = outline.get('xmlUrl')
        if text and xmlUrl:
            print text
            print xmlUrl
    

    This handles both your OPML snippet and similar OPML files I found on the web like this political science list. And it's very simple with nothing tricky about it. (I'm not bragging, that's just the benefit you get from using an XML parser instead of regular expressions.)