pythonselenium-webdriverweb-scrapingbeautifulsoup

How to scrape all data from Linkedin using Python in incognito mode


I am working on a python project whereby I am scraping data from linkedin using selenium and beautifulsoup. My program works fine but the problem is that it gets only 25 results instead of all the results. I have gone through previous solutions to this and they are saying that I can use the page number. The problem is that to load more data, one has to scroll to the bottom and the loaded data does not change the page number. Also after scrolling for some time, you will get a "load more" button which has to be clicked to load more items. Here is my program:

    from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

from bs4 import BeautifulSoup as beauty
import re
import requests
import logging
import time
from datetime import datetime, timedelta

chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("incognito")

url_link = 'https://www.linkedin.com/jobs/search/?currentJobId=3187861296&geoId=102713980&keywords=mckinsey&location=India&refresh=true'

driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()), options=chrome_options
    )
if url_link.split(".")[1] == "linkedin":
    print(f"{url_link} is found and active. Scraping jobs from it")
    response = requests.get(url_link)
    soup = beauty(response.content, "html.parser")
    jobs = soup.find_all(
                "div",
                class_="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card",
                )
    for job in jobs:
        try:
            job_title = job.find(
                            "h3", class_="base-search-card__title"
                        ).text.strip()
            with open("job_title.txt", "a") as f:
                print(f"Job title is {job_title}", file=f)
            job_company = job.find(
                        "h4", class_="base-search-card__subtitle"
                    ).text.strip()
            with open("job_company.txt", "a") as f:
                print(f"Job company is {job_company}", file=f)
            job_location = job.find(
                        "span", class_="job-search-card__location"
                    ).text.strip()
        except Exception as e:
            print(f"warning {e}")
            continue 

Any help will be highly appreciated. Thanks.


Solution

  • Seeing your code here you used requests to make the request, and it doesn't interpret javascript. So the other companies that Linkedin shows when you open the tab are on demand while you go down the page, so the right thing would be to make the request through selenium itself and run javascript on it, perhaps it will be necessary to simulate a mouse scroll down also