I am trying to webscrape a link that belongs to a previous button on this website. (The final purpose is to enrich data for a RAG chatbot)
https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
The prev/next buttons are in the top right corner. The link that has to be extracted on the given example subpage would be this one:
href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"
I tried the standard way with Beautifulsoup:
from bs4 import BeautifulSoup
import requests
url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)
# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)
# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)
However, neither do test 1 and 2 deliver the whole html section, nor does the direct query for the link work. this section comes back with test1:
<a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
<span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>
Thanks in Advance :)
The actual content is within an iframe
that has a slightly different url; Prinect/measuring/measuring-4.htm
instead of #t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
You can get the content + the next & previous paths like this:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base_url = 'https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/'
path = 'Prinect/measuring/measuring-4.htm'
response = requests.get(urljoin(base_url, path))
soup = BeautifulSoup(response.text, 'html.parser')
prev_path = soup.head.select_one('meta[name=brsprev]').get('value')
next_path = soup.head.select_one('meta[name=brsnext]').get('value')
print(f'previous: {urljoin(base_url, prev_path)}')
print(f'next: {urljoin(base_url, next_path)}')