[SOLVED] 403 error when trying to web scrape online job sites using python

403 error when trying to web scrape online job sites using python

I am trying to web scrape online job offering web sites for my coursera project. I keep getting a 403 error which, after I searched for its meaning online, I found out that it means that the web site has anti-web scraping protection. Does anyone any countermeasure for this ?

PS: I have tried web scraping on indeed and weworkremotely web sites with the same error after executing my code. Here's my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://weworkremotely.com/remote-jobs'

# We Work Remotely website blocks traffic from non-browsers, so we add extra parameters
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

# Send a request to the website and get the HTML content
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    # Creating empty lists to store the data
    job_titles = []
    companies = []
    locations = []
    job_links = []

    job_sections = soup.find_all('section', class_='jobs')

    for section in job_sections:
        jobs = section.find_all('li', class_='feature')  # Ensure this class matches the site's HTML

        for job in jobs:
            # Job title
            title_tag = job.find('span', class_='title')
            title = title_tag.text.strip() if title_tag else 'N/A'
            job_titles.append(title)

            # Company name
            company_tag = job.find('span', class_='company')
            company = company_tag.text.strip() if company_tag else 'N/A'
            companies.append(company)

            # Location
            location_tag = job.find('span', class_='region company')
            location = location_tag.text.strip() if location_tag else 'Remote'
            locations.append(location)

            # Job link
            job_link_tag = job.find('a', href=True)
            job_link = 'https://weworkremotely.com' + job_link_tag['href'] if job_link_tag else 'N/A'
            job_links.append(job_link)

    # Create a DataFrame using the extracted data
    job_data = pd.DataFrame({
        'Job Title': job_titles,
        'Company': companies,
        'Location': locations,
        'Job Link': job_links
    })

    # Save the data to a CSV file
    job_data.to_csv('we_work_remotely_jobs.csv', index=False)
    print("Job listings have been successfully saved to we_work_remotely_jobs.csv")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Solution

To address a 403 Forbidden error when web scraping:

Use a Valid User-Agent: Set a common User-Agent header to mimic a browser.
Use Proxies: Rotate IP addresses using proxies to avoid IP blocking.
Respect robots.txt: Check and follow the website's scraping rules.
Add Delays: Introduce delays between requests to mimic human behavior.
Handle JavaScript: Use tools like Selenium for websites with JavaScript-rendered content.

Source: ScrapingBee - How to Handle a 403 Forbidden Error in Web Scraping