pythonxmlelementtreeepg

Trying to search EPG XML-data


I'm trying to search the EPG (Electronic Program Guide) in XML-format (xmltv). I want to find all programs that contain a specific text, for example which channels will show a specific football (soccer) game today. Sample data (real data is > 20000 elements):

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="TX" generator-info-url="http://epg.net:8000/">
<channel id="GaliTV.es">
    <display-name>GaliTV</display-name>
    <icon src="http://logo.com/logos/GaliTV.png"/>
</channel>
<programme start="20210814080000 +0200" stop="20210814085500 +0200" channel="GaliciaTV.es" >
        <title>A Catedral de Santiago e o Mestre Mateo</title>
        <desc>Serie de catedral de Santiago de Compostela.</desc>
    </programme>
    <programme start="20210815050000 +0200" stop="20210815055500 +0200" channel="GaliciaTV.es" >
        <title>santiago</title>
        <desc>Chili.</desc>
    </programme>
</tv>

I want to display the <programme> attributes only if the title or desc properties contain a specific text (case-insensitive). Using ElementTree, I tried this:

for title in root.findall("./programme/title"):
   match = re.search(r'Santiago',title.text)
   if match:
       print(title.text)

It will find a result, but:

  1. I get an error that I don't understand:
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/re.py", line 146, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
  1. I don't know how to search case-insensitive, [Ss]antiago does not work.
  2. I want to return the result from the parent-element (for example programme.attributes).

Solution

  • You don't read regex for that; try

    for title in doc.findall('.//programme//title'):
        if "santiago" in title.text.lower():
            print(title.text)
    

    The output for your sample should be

    A Catedral de Santiago e o Mestre Mateo
    santiago
    

    EDIT:

    To get all the data from each programme try this:

    for prog in doc.findall('.//programme'):
        title = prog.find('title').text
        if "santiago" in title.lower():      
            start,stop,channel = prog.attrib.values()
            desc = prog.find('.//desc').text
            print(start,stop,channel,'\n',title,'\n',desc)
            print('-----------')
    

    Output:

    20210814080000 +0200 20210814085500 +0200 GaliciaTV.es 
     A Catedral de Santiago e o Mestre Mateo 
     Chili.
    -----------
    20210815050000 +0200 20210815055500 +0200 GaliciaTV.es 
     santiago 
     Chili.
    

    I would also add that if the xml get a little more complicated, it would probably be a good idea to switch from ElementTree to lxml, since the latter has better xpath support.