pythonxmlweb-scrapingbeautifulsoup

Python, extract urls from xml sitemap that contain a certain word


I'm trying to extract all urls from a sitemap that contain the word foo in the url. I've managed to extract all the urls but can't figure out how to only get the ones I want. So in the below example I only want the urls for apples and pears returned.

<url>
<loc>
https://www.example.com/p-1224-apples-foo-09897.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-1433-pears-foo-00077.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-3411-oranges-ping-66554.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>

Solution

  • I modify the xml to valid format (add <urls> and </urls>), save them into src.xml:

    <urls>
    <url>
    <loc>
    https://www.example.com/p-1224-apples-foo-09897.php
    </loc>
    <lastmod>2018-05-29</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
    </url>
    <url>
    <loc>
    https://www.example.com/p-1433-pears-foo-00077.php
    </loc>
    <lastmod>2018-05-29</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
    </url>
    <url>
    <loc>
    https://www.example.com/p-3411-oranges-ping-66554.php
    </loc>
    <lastmod>2018-05-29</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
    </url>
    </urls>
    

    Use xml.etree.ElementTree to parse xml:

    >>> import xml.etree.ElementTree as ET
    >>> tree = ET.parse('src.xml')
    >>> root = tree.getroot()
    >>> for url in root.findall('url'):
    ...     for loc in url.findall('loc'):
    ...             if loc.text.__contains__('foo'):
    ...                     print(loc.text)
    ...
    
    https://www.example.com/p-1224-apples-foo-09897.php
    https://www.example.com/p-1433-pears-foo-00077.php