javascriptjqueryajaxsingle-page-applicationjquery-load

Load a SPA webpage via AJAX


I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.

So for example, when I route to the following address:

https://connect.garmin.com/modern/activity/1915361012

And then enter this into the console (after the page has loaded):

var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());

Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:

Working Screenshot


However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):

view-source:https://connect.garmin.com/modern/activity/1915361012

So if I use either of the the following AJAX calls:

// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim()    );
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

I'll still get the initial footer, but won't get any of the other page contents:

Broken - Screenshot


I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:

jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    $page.find("script").each(function() {
        var scriptContent = $(this).html(); //Grab the content of this tag
        eval(scriptContent); //Execute the content
    });
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

Q: Any options to fully load a webpage that will scrapable over JavaScript?


Solution

  • You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.

    The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.

    I wanted to try Headless Chrome so let's see what it can do with your page:

    Quick check using internal REPL

    Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:

    % chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
    [0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
    >>> $('body').find('.page-title').text().trim() 
    {"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
    

    NB: to get chrome command line working on a Mac I did this beforehand:

    alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
    

    Using programmatically with Node & Puppeteer

    Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

    (Step 0 : Install Node & Yarn if you don't have them)

    In a new directory:

    yarn init
    yarn add puppeteer
    

    Create index.js with this:

    const puppeteer = require('puppeteer');
    (async() => {
        const url = 'https://connect.garmin.com/modern/activity/1915361012';
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        // Go to URL and wait for page to load
        await page.goto(url, {waitUntil: 'networkidle'});
        // Wait for the results to show up
        await page.waitForSelector('.page-title');
        // Extract the results from the page
        const text = await page.evaluate(() => {
            const title = document.querySelector('.page-title');
            return title.innerText.trim();
        });
        console.log(`Found: ${text}`);
        browser.close();
    })();
    

    Result:

    $ node index.js 
    Found: Daily Mile - Round 2 - Day 27