xpathnokogiriscraper

XPath along with nokogiri; tutorials/examples?


I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.

When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody.

I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.


Solution

  • The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

    The second trick is to remember to use // to start your XPath, not /, unless you're absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

    Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby's OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There's no chance of having any changes to the file that way.

    Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.

    Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

    Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

    Using those methods we can search for all the links in a page and return their text and related href, using something like:

    require 'awesome_print'
    require 'nokogiri'
    require 'open-uri'
    doc = Nokogiri::HTML(open('http://www.example.com'))
    ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
    

    Which outputs:

    [
        [0] [
            [0] "/",
            [1] ""
        ],
        [1] [
            [0] "/domains/",
            [1] "Domains"
        ],
        [2] [
            [0] "/numbers/",
            [1] "Numbers"
        ],
        [3] [
            [0] "/protocols/",
            [1] "Protocols"
        ],
        [4] [
            [0] "/about/",
            [1] "About IANA"
        ]
    ]