pythonselenium-webdriverweb-scrapingbeautifulsouphttp-status-code-403

Web Scraping: How to scrape event details from a dynamic website with Python?


I'm trying to scrape event details (event name, date, time, and tags) from the Central Park events calendar at https://www.centralparknyc.org/calendar. The website is dynamic, and it seems that the event details are not loading while scraping.

I've attempted to use requests and BeautifulSoup to scrape the content, but I'm encountering a 403 Forbidden error, which I suspect is due to the website's bot protection measures.

Could someone guide me on how to properly scrape this dynamic content using Python? Any advice on handling dynamic content and bot detection would be greatly appreciated.

import requests
from bs4 import BeautifulSoup

url = 'https://www.centralparknyc.org/calendar'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    # Process the page
    soup = BeautifulSoup(response.content, 'html.parser')
    # ... scraping logic here ...
else:
    print(f'Failed to retrieve the webpage: {response.status_code}')

Solution

  • You can directly get data from the API that populates the web page. That would remove the need of BeautifulSoup. If you open the developer tools on your browser, you will see that when you click to load the next events, the page makes an API call that returns JSON data you are looking for.

    We can use that to get data and tweak to get more per call, for example getting data at page 1 but with 25 locations.

    import requests
    
    URL = "https://www.centralparknyc.org/calendar.json"
    params = {"page":1, "elementsPerPage":25}
    
    r = requests.get(URL, params=params)
    
    print(r.json())
    
    

    Developer Tool:

    API call

    Note: Remember to read their Terms of Use.