htmlhttpweb-scrapingurl

Programmatically get movie name from hulu url?


I am using JavaScript.

Is there any programmatic way to fetch the movie name from a hulu url?

For example for the url

https://www.hulu.com/watch/78974b54-1feb-43ce-9a99-1c1e9e5fce3f

The response should be

My Favorite Girlfriend

The URL itself is just a uuid. I tried to fetch the page and look at the http response headers, html meta tag, but there is nothing useful.


Solution

  • Looking at the document returned from that URL there is a script tag that contains the information you need:

    <script type="application/ld+json"> 
       {"@context":"http://schema.org","@type":"Movie","name":"My Favorite Girlfriend","description":"A chef's life gets complicated when he falls for a beautiful young woman who has multiple personalities.",
    ...
    </script>
    

    Using the npm package cheerio and some javascript to parse this:

    const cheerio = require('cheerio');
    
    const getMovieName = async (url) => {
    
        const htmlContent = await (await fetch(url)).text();
    
        // Load the HTML content into cheerio
        const $ = cheerio.load(htmlContent);
    
        // Find the script element with type "application/ld+json"
        const scriptElement = $('script[type="application/ld+json"]').first();
    
        if (scriptElement) {
            try {
                // Parse the JSON content
                const jsonData = JSON.parse(scriptElement.html());
    
                // Access the parsed data
                console.log(jsonData.name);
                return jsonData.name;
    
                // You can access other properties as well
                // For example: jsonData['@context'], jsonData['@type']
            } catch (error) {
                console.error('Error parsing JSON:', error);
            }
        } else {
            console.error('Script element not found');
        }
    }
    
    getMovieName("https://www.hulu.com/watch/78974b54-1feb-43ce-9a99-1c1e9e5fce3f")