pythonhtmlweb-scrapingbeautifulsoup

Can someone help me access a link from a web page using Python?


Specifically, I'm trying to extract the link to the screenwriter's page from a Rotten Tomatoes movie page. So for example, from https://www.rottentomatoes.com/m/dangerous_animals, I'm trying to get https://www.rottentomatoes.com/celebrity/nick_lepard. You have to scroll down the page a little and it's under Movie Info.

I've used BeautifulSoup to pull some other cast links from the page, but can't figure out how to access the screenwriter link. The section with the link I'm trying to get has a bunch of divs and tags with the same IDs, so I couldn't figure out how to single it out.


Solution

  • The site uses static rendering so it's not necessary to use an automated browser and you should be able to get the data you're after with just requests and bs4.

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    BASE_URL = "https://www.rottentomatoes.com/"
    
    response = requests.get(urljoin(BASE_URL, "/m/dangerous_animals"))
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find the "Screenwriter" tag then go up two levels in the DOM to the <dt> tag.
    dt = soup.body.find(string="Screenwriter").parent.parent
    # Find the adjacent <dd> tag.
    dd = dt.find_next_sibling()
    # Dig down to the <rt-link> tag.
    link = dd.find("rt-link")
    
    print(urljoin(BASE_URL, link["href"]))