pythonparsingweb-scrapingbeautifulsoup

Scraping website, but want to choose an img URL from a srcset and do it nine more times


I'm trying to scrape the BBC Sounds website for **all of the ** 'currently playing' images. I'm not bothered about which size to use, 400w might be a good.

Below is a relevant excerpt from the HTML and my current python script. A variation on this works brilliantly for the 'now playing' text, but I haven't been able to get it to work for the image URLs, which is what I'm after, I think probably because a) there's so many image URLs to choose from and b) there's a whitespace which no doubt the parser doesn't like. Please bear in mind the HTML code below is repeated about 10 times for each of the channels. I've included just one as an example. Thank you!

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.co.uk/sounds"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "sc-o-responsive-image__img sc-u-circle"})

print g_data[0].text
print g_data[1].text
print g_data[2].text
print g_data[3].text
print g_data[4].text
print g_data[5].text
print g_data[6].text
print g_data[7].text
print g_data[8].text
print g_data[9].text

.

<div class="gel-layout__item sc-o-island"> 
<div class="sc-c-network-item__image sc-o-island" aria-hidden="true"> 
    <div class="sc-c-rsimage sc-o-responsive-image sc-o-responsive-image--1by1 sc-u-circle"> 
<img alt="" class="sc-o-responsive-image__img sc-u-circle" 
    src="https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg" srcSet="https://ichef.bbci.co.uk/images/ic/160x160/p07fzzgr.jpg 160w,
    https://ichef.bbci.co.uk/images/ic/192x192/p07fzzgr.jpg 192w,
    https://ichef.bbci.co.uk/images/ic/224x224/p07fzzgr.jpg 224w,
    https://ichef.bbci.co.uk/images/ic/288x288/p07fzzgr.jpg 288w,
    https://ichef.bbci.co.uk/images/ic/368x368/p07fzzgr.jpg 368w,
    https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg 400w,
    https://ichef.bbci.co.uk/images/ic/448x448/p07fzzgr.jpg 448w,
    https://ichef.bbci.co.uk/images/ic/496x496/p07fzzgr.jpg 496w,
    https://ichef.bbci.co.uk/images/ic/512x512/p07fzzgr.jpg 512w,
    https://ichef.bbci.co.uk/images/ic/576x576/p07fzzgr.jpg 576w,
    https://ichef.bbci.co.uk/images/ic/624x624/p07fzzgr.jpg 624w" 
    sizes="(max-width: 400px) 34vw,(max-width: 600px) 25vw,17vw"/>

Solution

  • import requests
    from bs4 import BeautifulSoup
    
    r = requests.get("https://www.bbc.co.uk/sounds")
    soup = BeautifulSoup(r.text, 'html.parser')
    
    for item in soup.findAll("img", {'class': 'sc-o-responsive-image__img sc-u-circle'}):
        print(item.get("src"))
    

    Output:

    https://ichef.bbci.co.uk/images/ic/400x400/p05mpj80.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07dg040.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07zml97.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p0428n3t.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p01lyv4b.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p06yphh0.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p05v4t1c.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p06z9zzc.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p06x0hxb.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p06n253f.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p060m6jj.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07l4fjw.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p03710d6.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p078qrgm.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
    https://ichef.bbci.co.uk/images/ic/400x400/p03crmyc.jpg