kotlinweb-scrapinghtmlunitheadless-browser

Unable to get `src` attribute of `<video>` with HTMLUnit


I am creating a video scraper (for the Rumble website) and I am trying to get the src attribute of the video using HTMLUnit, this is because the element is added dynamically to the page (I am a beginner to these APIs):

    val webClient = WebClient()
    webClient.options.isThrowExceptionOnFailingStatusCode = false
    webClient.options.isThrowExceptionOnScriptError = false
    webClient.options.isJavaScriptEnabled = true

    val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
    Thread.sleep(10000)
    val document: Document = Jsoup.parse(myPage!!.asXml())
    println(document)

The issue is, the output for the <video> element is the following:

        <video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>

Whereas -- if you navigate to the page itself and let the JS load -- it should be:

<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg" src="blob:https://rumble.com/91372f42-30cf-46b3-8850-805ee634e2e8"></video>

Some attributes are missing, which are crucial for my scraper to work. I need the src value so that ExoPlayer can play the video.

I am not totally sure, but I was wondering whether it had to do with the fact that the crossOrigin attribute is anonymous in the JavaScript:

<video muted playsinline hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="'+t+'"'+(a.vars.opts.cc?' crossorigin="anonymous"':"")+'>

I tried to play around with the different HTMLUnit options, as well as look online but I still haven't been able to extract the right attributes I need so that it can work.

How would I be able to bypass this and get the appropriate element values (src) that I need for the scraper using HTMLUnit? Is this even possible to do with HTMLUnit? I was also suspecting that maybe the site owners added this cross origin anonymous statement because it can bypass scrapers, though I am not sure.

How to reproduce my issue

Navigate to this link with a GUI browser.

Press 'Inspect Element' until you find the <video> HTML tag and observe that it contains an src attribute as you would expect to the mp4 file:

<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" src="https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&amp;b=0" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg"></video>

Now, let's simulate this with a headless browser, so add the following code to IntelliJ or any IDE (add a dependency to HTMLUnit and JSoup):

To gradle (Kotlin):

    implementation(group = "net.sourceforge.htmlunit", name = "htmlunit", version = "2.64.0")
    implementation("org.jsoup:jsoup:1.15.3")

To gradle (Groovy):

    implementation group = 'net.sourceforge.htmlunit', name = 'htmlunit', version = '2.64.0'
    implementation 'org.jsoup:jsoup:1.15.3'

Then in Main function:

   val webClient = WebClient()
    webClient.options.isThrowExceptionOnFailingStatusCode = false
    webClient.options.isThrowExceptionOnScriptError = false
    webClient.options.isJavaScriptEnabled = true

    val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
    Thread.sleep(10000)
    val document: Document = Jsoup.parse(myPage!!.asXml())
    println(".....................")
    println(document.getElementsByTag("video").first())

If it throws an exception add this:


    LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.html.HtmlScript").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.WindowProxy").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
    java.util.logging.Logger.getLogger("org.apache").setLevel(Level.OFF);

We are simply fetching the page with the headless browser and then using JSoup to parse the HTML output and finding the first video element.

Observe that the output does not contain any 'src' attribute as you saw in the GUI browser:

<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>

Screenshot of how your output should look like in the console:

enter image description here

This is the major issue I am having, the src attribute of the <video> element is seemingly disappeared in the headless browser, and I am unsure why although I suspect it's related to some sort of mp4 codec issue.


Solution

  • Correct, the js support for the video element was not sufficient for this case.

    Have done a bunch of fixes/improvements and the upcoming version 2.66.0 will be able to support this.

    Btw: there is no need to parse the page a second time using jsoup - HtmlUnit has all the methods to deeply look inside the dom tree of the current page.

    String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
    
    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);
    
        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10_000);
    
        HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
    
        System.out.println(video.getSrc());
    }
    

    This code prints https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0 - the same as the source attribute in the browser.

    But there are still two js errors reported when running this code. This is because some other js (i guess some tracking staff) provokes this errors. You can fix this by ignoring the js code for this two locations, this will make the code a bit faster also.

    String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
    
    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);
    
        // ignore some js
        new WebConnectionWrapper(webClient) {
            public WebResponse getResponse(WebRequest request) throws IOException {
                WebResponse response = super.getResponse(request);
                if (request.getUrl().toExternalForm().contains("sovrn_standalone_beacon.js")
                    || request.getUrl().toExternalForm().contains("r2.js")) {
                    WebResponseData data = new WebResponseData("".getBytes(response.getContentCharset()),
                        response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
                    response = new WebResponse(data, request, response.getLoadTime());
                }
                return response;
            }
        };
    
        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10_000);
    
        HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
    
        System.out.println(video.getSrc());
    

    Thanks for this report - will inform on https://twitter.com/htmlunit about the new release.