I am trying to scrape URLs from a page that uses JavaScript. Instead of having links on the page, they created onClick
events for a number of table rows, whereby, when you click the row, it takes you to the link.
I tried scraping the URLs using Mechanize:
agent = Mechanize.new
page = agent.get(url)
page.links_with(:href => /^http?/).each do |link|
puts link.href
end
But, looking for links via a HREF reference doesn't work here, because they're on the page as part of the onClick
event:
<tr onclick="window.open('/someurl');">
Is there a good way to use Mechanize, or some other gem, to parse the code on the page and extract the URLs embedded in the onClick
event?
If there's no good out-of-the-box solution, what might be the best regex to do that? I'm a little new to regex, so not quite able to pull together something on my own yet.
You should use a parser. Regex and HTML/XML don't mix well, because regex are not designed to handle the irregularities HTML and XML documents contain. Very simple tasks might work with a pattern, but you'll quickly find they are fragile and easily broken when the HTML changes.
Mechanize for Ruby, uses Nokogiri internally, which is an excellent way to get at those parameters. You can access Mechanize's internal Nokogiri document and, from it, find the <tr>
tags:
require 'mechanize'
page = Mechanize.new
page = agent.get('http://somesite.foo.com')
page.search('tr[onclick]').map{ |n| n['onclick'][/\(['"]([^)]+)['"]\)/, 1] }
If I use Nokogiri directly to parse this fragment:
<tr onclick="window.open('/someurl');">
I can do this:
require 'nokogiri'
page = Nokogiri::HTML(%[<tr onclick="window.open('/someurl');">])
page.search('tr[onclick]').map{ |n| n['onclick'][/\(['"]([^)]+)['"]\)/, 1] }
=> ["/someurl"]
Notice that I'm searching using a CSS accessor 'tr[onclick]'
, which makes it pretty easy to find a particular node. If you know JavaScript, CSS or jQuery you'll find it pretty easy to pick up Nokogiri using its built-in support for CSS.
Also,
n['onclick'][/\(['"]([^)]+)['"]\)/, 1]
could be written alternately as:
n['onclick'][/\(([^)]+)\)/, 1][1..-2]