pythonseleniumweb-scrapingbeautifulsoup

Beautifulsoup : href link is undefined


I want to scrap a website, when I reach any <a> tag the link is "job/undefined", I used post request to fetch data from the page.

Post request with postdata in this code :

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}

postData = {
  'search': 'search',
  'facets[camp_type]':'day_camp',
  'open[choices-made-content]': 'true'}

url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)

soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)

Sample from the output :

<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
    <a href="job/undefined" style="color:#413E52;text-decoration:none">
    Network and Security engineer
    </a>
</span>

Solution

  • EDIT

    Part of content is served dynamically so, you have to fetch the jobs hashid via api and then create the link yourself or use the data from JSON response:

    import requests
    
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
    url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
    jobs = requests.get(url, headers=headers).json()['included']['jobs']
    
    ['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]
    

    To get the links from each job post change your css selector to select your elements more specific, also try to use static identifiers or HTML structure over classes:

    .select('h2 a')
    

    To get a list of all links use a list comprehension:

    ['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
    

    Example

    from bs4 import BeautifulSoup
    import requests
    
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
    
    postData = {
     'search': 'search',
     'facets[camp_type]':'day_camp',
     'open[choices-made-content]': 'true'}
    
    url = 'https://www.trustme.work/en'
    html_1 = requests.post(url, headers=headers, data=postData)
    
    soup1 = BeautifulSoup(html_1.text, 'lxml')
    ['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]