I am working on scrapping company logos from there web site. I have 12 million company records of their domain names.
I am trying to scrape from web site if the web site is forbidden then I am trying to scrape from wikipedia page of theirs.
This is my code which I have worked seperately for domain names and wikipedia page.
from urllib.request import urlopen
from bs4 import BeautifulSoup
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
for item in images:
print(item['src'])
The above code just fetches data from one company and prints all the image sources from wiki page. However, I need to fetch only logos from wiki page and scale it to many companies.
Output from the above code looks like this:
//upload.wikimedia.org/wikipedia/en/thumb/6/6c/Wiki_letter_w.svg/40px-Wiki_letter_w.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/6/66/Pepsi_355ml.png/150px-Pepsi_355ml.png
//upload.wikimedia.org/wikipedia/commons/thumb/2/21/HMB_Bern_New_Bern_Caleb_Bradham.jpg/220px-HMB_Bern_New_Bern_Caleb_Bradham.jpg\
However i need to fetch only the image sources whic has company logos. Expected output:
upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
Please help me to store the link in a dataframe along with domain name.
You can try something as follows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
htmldata = urlopen('https://en.wikipedia.org/wiki/Pepsi')
soup = BeautifulSoup(htmldata, 'html.parser')
images = soup.find_all('img')
for item in images:
img = 'https:' +item['src']
#print(img)
if 'logo' in img:
print(img)
Output:
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_logo_2014.svg/140px-Pepsi_logo_2014.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Pepsi_Cola_logo_1902.svg/90px-Pepsi_Cola_logo_1902.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Pepsi_Cola_logo_1940.svg/90px-Pepsi_Cola_logo_1940.svg.pnghttps://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Pepsi_logo.svg/220px-Pepsi_logo.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/Pepsi_logo_2008.svg/220px-Pepsi_logo_2008.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/30px-Commons-logo.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a6/PepsiCo_logo.svg/130px-PepsiCo_logo.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png