javascriptphantomjsview-source

Get the raw page content with PhantomJS


Is it possible to get the raw html from a webpage using PhantomJS, before any javascript is executed.

The following script is returning the html after all scripts are loaded and executed.

var webPage = require('webpage');
var page = webPage.create();

page.open('http://stackoverflow.com', function (status) {
    var content = page.content;
    console.log('Content: ' + content);
    phantom.exit();
});

Is there a way to access also the initial source of the page?


Solution

  • DOMContentLoaded is the earliest event that is triggered when the page is loading, but it seems it is already too late in your case, because JavaScript can be executed before DOMContentLoaded is triggered (think <script>doSomething();</script>).

    The next idea would be to run setInterval(check, 5); where check tries to determine whether the initial HTML is fully loaded, but this doesn't guarantee that no other JavaScript already ran and it is impossible to detect whether the page is loaded, because page.content always includes </body></html>.

    The obvious solution would be to disable JavaScript entirely with page.settings.javascriptEnabled = false;, but if you do that, you won't be able to access the DOM anymore. The only way do access it, would be through page.content or similar properties.

    If you need only the page source, don't use PhantomJS for that. The are many solutions for this such as cURL.