I want to extract all links of this website: https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#/tab/general
The information I want are stored in the tbody: page code
Every time I try to extract the data I get no result.
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
url = "https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#complex-searchresult"
session = HTMLSession()
r = session.get(url)
r.html.render()
soup = BeautifulSoup(r.html.html,'html.parser')
print(r.html.search("Details"))
Thank you for your help!
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
You can scrape that data like this, it returns json which is easy enough to parse:
import requests
headers = {
'Referer':'https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
for page in range(2):
url = f'https://pflegefinder.bkk-dachverband.de/api/nursing-homes?required=1&statistics=1&maxDistance=0&careType=inpatientCare&limit=20&offset={page*20}'
resp = requests.get(url,headers=headers).json()
print(resp)
The api checks that you have a "Referer" header otherwise you get a 400 response.