python-3.xweb-scrapingbeautifulsouphtml-tbody

Python: Can't extract tbody information from website


I want to extract all links of this website: https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#/tab/general

The information I want are stored in the tbody: page code

Every time I try to extract the data I get no result.

from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession

url = "https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#complex-searchresult"



session = HTMLSession()
r = session.get(url)
r.html.render()

soup = BeautifulSoup(r.html.html,'html.parser')

print(r.html.search("Details"))

Thank you for your help!


Solution

  • The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.

    You can scrape that data like this, it returns json which is easy enough to parse:

    import requests
    
    headers = {
        'Referer':'https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
        }
    
    for page in range(2):
    
        url = f'https://pflegefinder.bkk-dachverband.de/api/nursing-homes?required=1&statistics=1&maxDistance=0&careType=inpatientCare&limit=20&offset={page*20}'
        resp = requests.get(url,headers=headers).json()
        print(resp)
    

    The api checks that you have a "Referer" header otherwise you get a 400 response.