Take your average mainstream url like:
https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/
And it's pretty easy to just copy and paste the text of the article as a human. But is there any standard way in 2023 to get just the text?
Using curl to just get the rendered html isn't perfect because sometimes the site only renders the text via javascript.
Using phantomjs or a headless browser sounds like the way, but then what's the modern technique for getting just the text and ignore the non-text?
going to answer my own question and recommend chromedp in golang. If you have chromedp.WaitReady("body"), chromedp.Nodes("//p[text()] | //li[text()]", &res),
you get all the javascript to execute on the page first and then you can read p or li text elements like so.
package main
import (
"context"
"fmt"
"log"
"github.com/chromedp/cdproto/cdp"
"github.com/chromedp/chromedp"
)
func main() {
url := "https://anyurl.com"
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// run task list
var res []*cdp.Node
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitReady("body"),
chromedp.Nodes("//p[text()] | //li[text()]", &res),
)
if err != nil {
log.Fatal(err)
}
for _, item := range res {
var innerHTML string
chromedp.Run(ctx,
chromedp.InnerHTML(item.FullXPath(), &innerHTML),
)
fmt.Println(innerHTML)
}
}