xmlgogo-colly

Web scrapping using Golang Colly, How to handle XML path not found?


I am using Colly for scrapping an ecommerce website. I will loop over many products.

Here is a snippet of my code getting a sub-title

    c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
        fmt.Println(e.Text)
})

However, not all products have a sub-title so the above XML path does not work for all cases.

When I reach a product which does not have a sub-title my code got crashed and return an error of

panic: expression must evaluate to a node-set

Here is my so far code:

    c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
    log.Println("Something went wrong:", err)
})

//Sub Title
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
})

c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")

Here is what I want

c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
    if no error {

        fmt.Println("NO ERROR)

    } else {

        fmt.Println("GOT ERROR")

    }
    
})

Solution

  • Maybe I figured out what went wrong in your code. Let me start with the final. As you can see, the error is originated from the panic statement at line 473 of the parse.go file. The package xpath has a method called parseNodeTest that does the following check:

    func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
        switch p.r.typ {
        case itemName:
            if p.r.canBeFunc && isNodeType(p.r) {
                var prop string
                switch p.r.name {
                case "comment", "text", "processing-instruction", "node":
                    prop = p.r.name
                }
                var name string
                p.next()
                p.skipItem(itemLParens)
                if prop == "processing-instruction" && p.r.typ != itemRParens {
                    checkItem(p.r, itemString)
                    name = p.r.strval
                    p.next()
                }
                p.skipItem(itemRParens)
                opnd = newAxisNode(axeTyp, name, "", prop, n)
            } else {
                prefix := p.r.prefix
                name := p.r.name
                p.next()
                if p.r.name == "*" {
                    name = ""
                }
                opnd = newAxisNode(axeTyp, name, prefix, "", n)
            }
        case itemStar:
            opnd = newAxisNode(axeTyp, "", "", "", n)
            p.next()
        default:
            panic("expression must evaluate to a node-set")
        }
        return opnd
    }
    

    The value of p.r.typ is itemNumber (28). This leads the switch to enter into the default branch and gives the error. The methods invoked before the above-mentioned one (you can see them in the call stack of your IDE) set the typ for the literal 1234 to this value and this caused an invalid XPath query. To make it works, you've to get rid of the 1234 and put some valid value.
    Let me know if this solves your issue, thanks!