I am trying to web scrape online job offering web sites for my coursera project. I keep getting a 403 error which, after I searched for its meaning online, I found out that it means that the web site has anti-web scraping protection. Does anyone any countermeasure for this ?
PS: I have tried web scraping on indeed and weworkremotely web sites with the same error after executing my code. Here's my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://weworkremotely.com/remote-jobs'
# We Work Remotely website blocks traffic from non-browsers, so we add extra parameters
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}
# Send a request to the website and get the HTML content
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Creating empty lists to store the data
job_titles = []
companies = []
locations = []
job_links = []
job_sections = soup.find_all('section', class_='jobs')
for section in job_sections:
jobs = section.find_all('li', class_='feature') # Ensure this class matches the site's HTML
for job in jobs:
# Job title
title_tag = job.find('span', class_='title')
title = title_tag.text.strip() if title_tag else 'N/A'
job_titles.append(title)
# Company name
company_tag = job.find('span', class_='company')
company = company_tag.text.strip() if company_tag else 'N/A'
companies.append(company)
# Location
location_tag = job.find('span', class_='region company')
location = location_tag.text.strip() if location_tag else 'Remote'
locations.append(location)
# Job link
job_link_tag = job.find('a', href=True)
job_link = 'https://weworkremotely.com' + job_link_tag['href'] if job_link_tag else 'N/A'
job_links.append(job_link)
# Create a DataFrame using the extracted data
job_data = pd.DataFrame({
'Job Title': job_titles,
'Company': companies,
'Location': locations,
'Job Link': job_links
})
# Save the data to a CSV file
job_data.to_csv('we_work_remotely_jobs.csv', index=False)
print("Job listings have been successfully saved to we_work_remotely_jobs.csv")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
To address a 403 Forbidden error when web scraping:
Use a Valid User-Agent: Set a common User-Agent header to mimic a browser.
Use Proxies: Rotate IP addresses using proxies to avoid IP blocking.
Respect robots.txt: Check and follow the website's scraping rules.
Add Delays: Introduce delays between requests to mimic human behavior.
Handle JavaScript: Use tools like Selenium for websites with JavaScript-rendered content.
Source: ScrapingBee - How to Handle a 403 Forbidden Error in Web Scraping