pythonpython-3.xxml-sitemap

sitemap xml parsing in python 3.x


My xml structure are bellow

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>hello world 1</loc>
        <image:image>
            <image:loc>this is image loc 1</image:loc>
            <image:title>this is image title 1</image:title>
        </image:image>
        <lastmod>2019-06-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
    <url>
        <loc>hello world 2</loc>
        <image:image>
            <image:loc>this is image loc 2</image:loc>
            <image:title>this is image title 2</image:title>
        </image:image>
        <lastmod>2020-03-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
</urlset>

i want to get only

hello world 1
hello world 2

My python code is bellow:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for url in root.findall('url'):
    loc = url.find('loc').text
    print(loc)

unfortunately it gives me nothing.

But when I change my xml to

<urlset>
    <url>
        <loc>hello world 1</loc>
        <lastmod>2019-06-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
    <url>
        <loc>hello world 2</loc>
        <lastmod>2020-03-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
</urlset>

it gives me correct result.

hello world 1
hello world 2

What can i do to get correct result without changing my xml? Because it doesn't make any sense to modify a 10000+ lines of file.

TIA


Solution

  • The (inelegant) fix to your code is:

    import xml.etree.ElementTree as ET
    tree = ET.parse('test.xml')
    root = tree.getroot()
    
    # In find/findall, prefix namespaced tags with the full namespace in braces
    for url in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
        loc = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
        print(loc)
    

    This is because you have to qualify you tag names with the namespace under which your XML is defined. The details on how use the find and findall methods with namespaces are from Parse XML namespace with Element Tree findall