For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.
I'm trying to use goquery
to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:
<html>
<body>
<h1>
<span class="text">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="text">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
I'd like to:
<h1..."text"
.<p..."text"
.<p>
tag that immediately follows the <h1>
tag.<h1>
tags on the page.So this is what I want it to look like:
<html>
<body>
<p>
<span class="text">Go totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<p>
<span class="text">debugger should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle</span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
With the code starting off like this,
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
html_code := strings.NewReader(`code_example_above`)
doc, _ := goquery.NewDocumentFromReader(html_code)
I know that I can read <h1..."text"
with:
h3_tag := doc.Find("h3 .text")
I also know that I can add the content of <h1..."text"
to the content of <p..."text"
with this:
doc.Find("p .text").Before("h3 .text")
^But this command inserts the content from every single case of <h1..."text"
before every single case of <p..."text"
.
Then, I found out how to get a step closer to what I want:
doc.Find("p .text").First().Before("h3 .text")
^This command inserts the content from every single case of <h1..."text"
only before the first case of <p..."text"
(which is closer to what I want).
I also tried using goquery
's Each()
function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each()
, right?)
My biggest issue is that I can't figure out how to associate each instance of <h1..."text"
with the <p..."text"
instance that immediately follows it.
If it helps, <h1..."text"
is always followed by <p..."text"
on the web pages I'm trying to parse.
My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.
I found out something else I can do:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
nex := s.Next().Text()
fmt.Println(s.Text(), nex, "\n\n")
})
^This prints out what I want--the contents of each instance of <h1..."text"
followed by its immediate instance of <p..."text"
. I had thought that s.Next()
would output the next instance of <h1>
, but it outputs the next tag in doc
--the *goquery.Selection
that it's iterating through. Is that correct?
Or, as mattn
pointed out, I could also use doc.Find("h1+p")
.
I'm still having trouble appending <h1..."text"
to <p..."text"
. I'll post it as another question because you can break this one down into multiple questions, and Mattn
already answered one.
I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.
h1+p
This returns h1 tags which has p tag in neighbor.