I am trying to pull the ingredients list from the following webpage:
https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/
So the first ingredient I want to pull would be Acetylated Lanolin, and the last ingredient would be Octyl Palmitate.
Looking at the page source for this URL, I learn that the pattern for the ingredients list looks like this:
<td valign="top" width="33%">Acetylated Lanolin <sup>5</sup></td>
So I wrote some code to pull the list, and it is giving me zero results. Below is the code.
import requests
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
When I try len(results)
, it gives me a zero.
What am I doing wrong? Why am I not able to pull the list as intended? I am a beginner to web scrapers.
Your web scraping code is working as intended. However, your request did not work. If you check the status code of your request, you can see that you get a 403 status.
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
print(r.status_code) # 403
What happens is that the server does not allow a non-browser request. To make it work, you need to use a header while making the request. This header should be similar to what a browser would send:
headers = {
'User-Agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/56.0.2924.76 Safari/537.36')
}
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/', headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
print(len(results))