I'm trying to scrape a webpage after login.
If I use only BeautifulSoup and requests I get
Please enable JavaScript to continue using this application.
So, I decided to use html_requests
with the following code:
from requests_html import HTMLSession
session = HTMLSession()
session.get(url)
session.post(loginUrl, data = {"email":"email@gmail.com", "password": "Pass123"})
resp.html.render()
But I get the same error or:
pyppeteer.errors.PageError: net::ERR_SSL_VERSION_OR_CIPHER_MISMATCH
So I decided to use selenium, even though I really prefer to use request due to higher script speed.
When I use selenium, it works fine, but when I load the selenium's page source into BeautifulSoup, I again get the
Please enable JavaScript to continue using this application.
error page.
Why? On driver is loaded fine and I just parse the HTML page from selenium.
How can I fix both the requests_html
and BeautifulSoup
errors?
You don't really need either pyppeteer or selenium. You can log in using a plain request
and get all the data you want.
The key here is to get the accessToken
via the Login
endpoint and then use it in subsequent requests.
The API calls I'm making here are the meat of the page after logging in. The rest of the HTML is just eye-candy. The data coming from the API corresponds to what you see on the site:
As for the pyppeteer.errors.PageError: net::ERR_SSL_VERSION_OR_CIPHER_MISMATCH
, this error is typically caused by an SSL/TLS handshake failure. The server you're trying to connect to may be using an outdated or unsupported SSL/TLS version or cipher suite.
You can read more about the error here.
TL;DR: There's not much you can do about it.
I'd recommend using my approach (no browser, just API calls).
Benefits of the following approach:
Here's how you can get the sale data:
import requests
from dateutil.parser import parse
login_url = "https://api-it.saywow.me/it-it/api/Users/Login"
sales_url = "https://api-it.saywow.me/it-it/api/Booking/GetCanBookSaleEvents"
payload = {
"email": "YOUR_EMAIL",
"password": "YOUR_PASSWORD",
}
def format_date(date: str) -> str:
return parse(date).strftime("%d %B")
def show_sales(sales_data: list) -> None:
for sale in sales_data:
event = sale["saleEvent"]["saleEventName"]
address = sale["saleEvent"]["addressFull"]
start_date = format_date(sale["saleEvent"]["startDate"])
end_date = format_date(sale["saleEvent"]["endDate"])
is_booked = sale["isBooked"]
template = f"""
Event: {event}
Address: {address}
Dates: {start_date} - {end_date}
Booked: {"Yes!" if is_booked else "You can book this event!"}
"""
print(template)
def main() -> None:
with requests.Session() as session:
response = session.post(login_url, json=payload)
token = response.json()["data"]["accessToken"]
sales = session.post(
sales_url,
headers={"Authorization": f"Bearer {token}"},
)
show_sales(sales.json()["data"])
if __name__ == "__main__":
main()
If you plug in your registration email and a valid password, you should see this:
Event: HOUSE OF LUXURY
Address: Viale John Fitzgerald Kennedy 54, Napoli NA
Dates: 08 December - 17 December
Booked: You can book this event!
Event: Monot Archive Sale
Address: Via Orobia 11, Milano MI
Dates: 28 November - 06 December
Booked: You can book this event!
There's plenty more in the sales_data
table, like location, phone numbers, etc.
Here's a sample:
...
"addressName": "Via Orobia",
"addressNumber": "11",
"addressCity": "Milano",
"addressProvince": "MI",
"addressZip": "20139",
"addressCountry": "IT",
"addressLat": 45.4426322,
"addressLon": 9.2056631,
...