gogoquery

Goquery loads empty document from distinctly not empty response


I've been trying to load a response into a goquery document, but it appears to be failing (though it throws no errors).

The response I'm trying to load comes from:

https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page=4

and while it doesn't throw any errors, when I call fmt.Println(goquery.OuterHtml(doc.Contents())) I get the output:

<html><head></head><body></body></html>

Meanwhile, If I don't attempt to load it into a goquery document, and instead call

s, _ := ioutil.ReadAll(resp.Body)
fmt.Println(string(s))

I get:

<!doctype html>
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 no-touch" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 no-touch" lang="en"> <![endif]-->
<!--[if gt IE 8]> <html class="no-js gt-ie-8 no-touch" lang="en"> <![endif]-->
<!--[if !IE]><!-->
<html class="no-js no-touch" lang="en">
<!--<![endif]-->

<head>
    <meta charset="utf-8">
    <title>Search | BBC Good Food</title>
    <!--[if IE]><![endif]-->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="prev" href="https://www.bbcgoodfood.com/search/recipes?page=3&amp;sort=created&amp;order=desc" />
    <link rel="next" href="https://www.bbcgoodfood.com/search/recipes?page=5&amp;sort=created&amp;order=desc" />
    <meta name="robots" content="noindex" />
    <style>
        .async-hide {
            opacity: 0 !important
        }
    ... etc

The basic logic of what I'm doing is as follows:

package main

import (
    "fmt"
    "net/http"
    "github.com/PuerkitoBio/goquery"
    "io/ioutil"
)

func main() {
    baseUrl := "https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page="
    i := 4

    // Make a request
    req, _ := http.NewRequest(http.MethodGet, fmt.Sprintf("%s%d", baseUrl, i), nil)

    // Create a new HTTP client and execute the request
    client := &http.Client{}
    resp, _ := client.Do(req)

    // Print out response
    s, _ := ioutil.ReadAll(resp.Body)
    fmt.Println(string(s))

    // Load into goquery doc
    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    fmt.Println(goquery.OuterHtml(doc.Contents()))
}

The full response can be found here. Is there any particular reason why this won't load?


Solution

  • Go's html parser doesn't seem to like the html you're getting - the <html> tags are all within comments, so I think it's just never getting going on the parsing.

    If you prepend the document with <html> everything works fine from there. One way to do that would be with a reader-wrapper, something like the following, which writes the html tag the first time Read is called and delegates to resp.Body on subsequent calls.

    import "io"
    
    var htmlTag string = "<html>\n"
    
    type htmlAddingReader struct {
        sentHtml bool
        source io.Reader
    }
    
    func (r *htmlAddingReader) Read(b []byte) (n int, err error) {
        if !r.sentHtml {
            copy(b, htmlTag)
            r.sentHtml = true
            return len(htmlTag), nil
        } else {
            return r.source.Read(b)
        }
    }
    

    To use this in your sample code, change the final section like so:

        // Load into goquery doc
        wrapped := &htmlAddingReader{}
        wrapped.source = resp.Body
        doc, _ := goquery.NewDocumentFromReader(wrapped)
        fmt.Println(goquery.OuterHtml(doc.Contents()))