pythonpython-3.xbeautifulsouprssrss-reader

convert a string into a Beautiful Soup object


I'm pretty new to python and posting on here, so any help would be much appreciated! I'm trying to use Beautiful Soup to dynamically parse over 30 different RSS blog feeds. surprisingly, they are not standard. So, I started by creating a list of all the potential xml tags I want to grab, I named it headers:

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

then I grab all the tags from the RSS feed I'm trying to scrape and put them into their own list, named tags:

import requests
from bs4 import BeautifulSoup as bs
requests.packages.urllib3.disable_warnings()

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

url = 'https://www.zdnet.com/blog/security/rss.xml'
resp = requests.get(url, verify=False)
soup = bs(resp.text, features='xml')
data = soup.find_all('item')

tags = [tag.name for tag in data[0].find_all()]
print(tags)

then I build a new list of tags, n_tags, with the overlap of elements in both lists:

n_tags = [i for i in headers if i in tags]
print(n_tags)

then I iterate through all the items in data(all the blog posts on the page) and I iterate through all the elements in my list of new tags(all the tags that were relevant to that blog). Where I get stuck is n_tags is a list of strings, not soup objects.

the manual way of parsing a feed is:

for item in data:
    print(item.title.text)
    print(item.description.text)
    print(item.pubDate.text)
    print(item.credit.text)
    print(item.link.text)

However, I want to iterate through the list of tags and insert them into the code to get the content of the xml tag.

for item in data:
    for el in n_tags:
    content = item + "." + el + ".text"
    print(content)

this returns an error:

TypeError: unsupported operand type(s) for +: 'Tag' and 'str'

I need to turn the string from the list into a soup "Tag" object so I can concatenate them. I tried recasting the Tag object to a string and reestablishing the entire string as a soup object, but it didn't work. It didn't error out, it just returned None

content = str(item) + "." + el + ".text"
print(soup.content)

the closest I got is:

for item in data:
    for el in n_tags:
        content = str(item) + "." + el + ".text"
        print(content)

it actually returns content, but it is not what I'm looking for, the ".text" doesn't seem to be applied and for every element in the list, the blog post content is repeated.

I'm out of ideas, thanks for reading. let me know if you have any questions.


Solution

  • I'm not sure if I understand your question right, but it seems you're trying to select text from only specific elements inside RSS feed.

    You can try this script to do that (using CSS selector):

    import requests
    from bs4 import BeautifulSoup as bs
    
    url = 'https://www.zdnet.com/blog/security/rss.xml'
    soup = bs(requests.get(url).content, 'html.parser')
    
    headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']
    
    for tag in soup.select(','.join(headers)):
        print(tag.text)
    

    Prints:

    ZDNet | security RSS
    
    Tue, 05 May 2020 00:15:23 +0000
    
    ZDNet | security RSS
    
    US financial industry regulator warns of widespread phishing campaign
    FINRA warns of phishing campaign aimed at stealing members' Microsoft Office or SharePoint passwords.
    Mon, 04 May 2020 23:29:00 +0000
    
    Academics turn PC power units into speakers to leak secrets from air-gapped systems
    POWER-SUPPLaY technique uses "singing capacitor" phenomenon for data exfiltration.
    Mon, 04 May 2020 16:06:00 +0000
    
    ... and so on.