I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.
When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody
.
I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.
The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.
The second trick is to remember to use //
to start your XPath, not /
, unless you're absolutely sure you want to start at the root of the document. //
is like a '**/*'
wildcard at the command-line in Linux. It searches everywhere.
Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody
, like you saw. Instead, use Ruby's OpenURI or curl
or wget
to retrieve the raw source, and look at it with an editor like vi
or vim
, or use less
or cat
it to the screen. There's no chance of having any changes to the file that way.
Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.
Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search
and at
. Both take either a CSS or XPath selector. search
, along with its sibling methods xpath
and css
, return a NodeSet
, which is basically an array of nodes that you can iterate over. at
, css_at
and xpath_at
return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath
variants accept an XPath, and the ...css
ones take a CSS accessor.
Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get]
and the text using text
.
Using those methods we can search for all the links in a page and return their text and related href, using something like:
require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
Which outputs:
[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]