I've built a web scraper ruby script using open-uri and nokogiri, I'm pretty new to it all but it's all working for a couple of the websites I need to extract data from that have full URL's in the source, apart from one which uses relative URL's.
What the script does, is opens the page, builds an array of pages to open then goes through and extracts the data from the css (not xpath).
How do I force the script to use full URL's in the instance where they're relative, it's been bugging me for a while and I can't seem to get it running
In my case, I think I need to add something at the point it pushes the URL's, could anyone please point me in the right direction? It would be hugely appreciated! Thanks!
require 'open-uri'
require 'nokogiri'
PAGE_URL = "http://www.OMMITED.co.uk"
page = Nokogiri::HTML(open(PAGE_URL, "User-Agent" => “OMMITED“))
links = page.css("a")
links_array = Array.new
links.each{|link|
url = link['href'].nil? ? 'empty' : link['href']
if url.include? 'category' and !url.include? '/all'
links_array.push url
end
}
tl;dr: Short answer at bottom.
OK, assuming you have a class variable called @url
containing the fully qualified URL of the current page:
require 'uri'
def full_url(rel, url)
return rel if rel.match /^[\w]*:\/\//
uri = URI(url)
if rel[0] == '/'
"#{uri.scheme}://#{uri.host}#{rel}"
else
path = uri.path.split('/')[0..-2].select{|m| !m.empty?}.join('/')
"#{uri.scheme}://#{uri.host}/#{path}/#{rel}"
end
end
Then you can call:
links_array.push full_url(url, @url)
You can put the method in the same class or in a helper class somewhere. It uses the Ruby URI library to find the relevant parts of the fully qualified URL, then constructs a new one from the relative path.
If the relative path starts with '/' it should come straight after the host.
If it doesn't start with a '/' then it needs to be in the same virtual directory as the current page. Thus, if the current page is:
http://www.host.com/aaa/bbb/ccc
and the relative path is:
ddd
then the output should be:
http://www.host.com/aaa/bbb/ddd
however, if the relative path is:
/ddd
then the output should be:
http://www.host.com/ddd
The code:
uri.path.split('/')[0..-2].select{|m| !m.empty?}.join('/')
takes the path of the full URL, splits it on '/' giving an array (['aaa','bbb', 'ccc']
), then removes the last element. (['aaa','bbb']
). The select removes any blank elements then the join stitches it up again. ("aaa/bbb"
)
OR
you could do it the boring way:
require 'uri'
URI.join("http://www.host.com/aaa/bbb/ccc", "/ddd").to_s
# => "http://www.host.com/ddd"
URI.join("http://www.host.com/aaa/bbb/ccc", "ddd").to_s
# => "http://www.host.com/aaa/bbb/ddd"
given your code:
links.each{|link|
url = link['href'].nil? ? 'empty' : link['href']
if url.include? 'category' and !url.include? '/all'
links_array.push url
end
}
I would re-write as:
links.each do |link|
url = link['href'].nil? ? 'empty' : link['href']
if url.include? 'category' && !url.include? '/all'
full_url = URI.join(PAGE_URL, url).to_s
puts full_url
links_array << url
puts links_array.inspect
end
end
Note: Stylistically, multi-line blocks should use do/end rather than {}. Indents should be two spaces. There shouldn't be spaces just inside parentheses. The << operator is favoured over push. Always use && in conditionals rather than and
, which has a far lower precedence and can cause issues. See the Github style guide:
https://github.com/styleguide/ruby
The puts
are there based on your comments, hopefully helping you figure out why your array isn't behaving. As it should be, based on the code you put in there. I'd prefer to use the debugger gem though. (Or byebug if you're on Ruby 2.x)