pythonimageweb-scrapingbeautifulsouplogos

Web scrapping for png images


I am working on scrapping company logos from there web site. I have 12 million company records of their domain names.

I am trying to scrape from web site if the web site is forbidden then I am trying to scrape from wikipedia page of theirs.

This is my code which I have worked seperately for domain names and wikipedia page.

from urllib.request import urlopen
from bs4 import BeautifulSoup
  
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
  
for item in images:
    print(item['src'])

The above code just fetches data from one company and prints all the image sources from wiki page. However, I need to fetch only logos from wiki page and scale it to many companies.

Output from the above code looks like this:

//upload.wikimedia.org/wikipedia/en/thumb/6/6c/Wiki_letter_w.svg/40px-Wiki_letter_w.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/6/66/Pepsi_355ml.png/150px-Pepsi_355ml.png
//upload.wikimedia.org/wikipedia/commons/thumb/2/21/HMB_Bern_New_Bern_Caleb_Bradham.jpg/220px-HMB_Bern_New_Bern_Caleb_Bradham.jpg\

However i need to fetch only the image sources whic has company logos. Expected output:

upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png

Please help me to store the link in a dataframe along with domain name.


Solution

  • You can try something as follows:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
      
    htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
    soup = BeautifulSoup(htmldata, 'html.parser')
    images = soup.find_all('img')
      
    for item in images:
        img = 'https:' +item['src']
        #print(img)
        if 'logo' in img:
            print(img)
    

    Output:

    https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
    https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_Cola_logo_1902.svg/90px-Pepsi_Cola_logo_1902.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Pepsi_Cola_logo_1940.svg/90px-Pepsi_Cola_logo_1940.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Pepsi_logo.svg/220px-Pepsi_logo.svg.png
    https://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/Pepsi_logo_2008.svg/220px-Pepsi_logo_2008.svg.png
    https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/30px-Commons-logo.svg.png
    https://upload.wikimedia.org/wikipedia/commons/thumb/a/a6/PepsiCo_logo.svg/130px-PepsiCo_logo.svg.png
    https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png