javascriptc#anglesharp

How to get the HTML as text using Anglesharp after the javascript has done loading the page?


I am trying to use AngleSharp to crawl a webpage on my localhost. The page is generated using Angular js dynamically. I am using AngleSharp to get the page. Also using AngleSharp Scripting library to run Javascript. Below is my code for POC purpose. I am unable to figure out where can I find the HTML of the page after Javascript rendering is complete.

t.Result.Source.Text gives me the page source of the webpage. Where can I find the Source after javascript has finished rendering? I am even unable to figure out if the javascript ran or not !

    static void Main(string[] args)
    {
        Task<IDocument> t = StartCrawl();
        t.Wait();
        string textContent = t.Result.Source.Text;
        Console.ReadKey();

    }

    private static async Task<IDocument> StartCrawl()
    {
        var config = Configuration.Default
            .WithDefaultLoader()
            .WithCss()
            .WithJavaScript();

        var context = BrowsingContext.New(config);
        var document = await context.OpenAsync("http://localhost:8000/#!/phones");
        return document;
    }

The view source of the url gives me this. How can I run all the javascripts on the page after page load. I can see 16 scripts in the document.Scripts property.

<!doctype html>
<html lang="en" ng-app="phonecatApp">
  <head>
    <meta charset="utf-8">
    <title>Google Phone Gallery</title>
    <link rel="stylesheet" href="bower_components/bootstrap/dist/css/bootstrap.css" />
    <link rel="stylesheet" href="app.css" />
    <link rel="stylesheet" href="app.animations.css" />

    <script src="bower_components/jquery/dist/jquery.js"></script>
    <script src="bower_components/angular/angular.js"></script>
    <script src="bower_components/angular-animate/angular-animate.js"></script>
    <script src="bower_components/angular-resource/angular-resource.js"></script>
    <script src="bower_components/angular-route/angular-route.js"></script>
    <script src="app.module.js"></script>
    <script src="app.config.js"></script>
    <script src="app.animations.js"></script>
    <script src="core/core.module.js"></script>
    <script src="core/checkmark/checkmark.filter.js"></script>
    <script src="core/phone/phone.module.js"></script>
    <script src="core/phone/phone.service.js"></script>
    <script src="phone-list/phone-list.module.js"></script>
    <script src="phone-list/phone-list.component.js"></script>
    <script src="phone-detail/phone-detail.module.js"></script>
    <script src="phone-detail/phone-detail.component.js"></script>
  </head>
  <body>

    <div class="view-container">
      <div ng-view class="view-frame"></div>
    </div>

  </body>
</html>


Solution

  • In AngleSharp (like in a browser) there is no notion of source after JS did something. You can look at the originally transferred source, but I guess that's not what you want.

    If you want to see the string serialization of the DOM at a particular time (e.g., after some DOM manipulation by a JS script) then just do:

    var currentSource = document.ToHtml(); // current serialization of the DOM
    

    Note that this will represent your DOM in HTML (text) form.

    What you did gives you the original source code:

    var textContent = t.Result.Source.Text; // will always contain the original source