web-scrapingphantomjsscreen-scrapinggoogle-chrome-headlessheadless-browser

What's the best way to get just the user readble word content of a page?


Take your average mainstream url like:

https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/

And it's pretty easy to just copy and paste the text of the article as a human. But is there any standard way in 2023 to get just the text?

  1. Using curl to just get the rendered html isn't perfect because sometimes the site only renders the text via javascript.

  2. Using phantomjs or a headless browser sounds like the way, but then what's the modern technique for getting just the text and ignore the non-text?


Solution

  • going to answer my own question and recommend chromedp in golang. If you have chromedp.WaitReady("body"), chromedp.Nodes("//p[text()] | //li[text()]", &res), you get all the javascript to execute on the page first and then you can read p or li text elements like so.

    package main
    
    import (
        "context"
        "fmt"
        "log"
    
        "github.com/chromedp/cdproto/cdp"
        "github.com/chromedp/chromedp"
    )
    
    func main() {
        url := "https://anyurl.com"
    
        ctx, cancel := chromedp.NewContext(context.Background())
        defer cancel()
    
        // run task list
        var res []*cdp.Node
        err := chromedp.Run(ctx,
            chromedp.Navigate(url),
            chromedp.WaitReady("body"),
            chromedp.Nodes("//p[text()] | //li[text()]", &res),
        )
        if err != nil {
            log.Fatal(err)
        }
    
        for _, item := range res {
    
            var innerHTML string
            chromedp.Run(ctx,
                chromedp.InnerHTML(item.FullXPath(), &innerHTML),
            )
    
            fmt.Println(innerHTML)
        }
    }