rubyurlnokogirirelative-pathopen-uri

Ruby - Opening absolute url from relative


I've built a web scraper ruby script using open-uri and nokogiri, I'm pretty new to it all but it's all working for a couple of the websites I need to extract data from that have full URL's in the source, apart from one which uses relative URL's.

What the script does, is opens the page, builds an array of pages to open then goes through and extracts the data from the css (not xpath).

How do I force the script to use full URL's in the instance where they're relative, it's been bugging me for a while and I can't seem to get it running

In my case, I think I need to add something at the point it pushes the URL's, could anyone please point me in the right direction? It would be hugely appreciated! Thanks!

require 'open-uri'
require 'nokogiri'

PAGE_URL = "http://www.OMMITED.co.uk"

page = Nokogiri::HTML(open(PAGE_URL, "User-Agent" => “OMMITED“))

links = page.css("a")

links_array = Array.new

links.each{|link|
        url = link['href'].nil? ? 'empty' : link['href']
        if url.include? 'category'  and !url.include? '/all'
                links_array.push url
        end
}

Solution

  • tl;dr: Short answer at bottom.

    OK, assuming you have a class variable called @url containing the fully qualified URL of the current page:

    require 'uri'
    
    def full_url(rel, url)
      return rel if rel.match /^[\w]*:\/\//
      uri = URI(url)
      if rel[0] == '/'
        "#{uri.scheme}://#{uri.host}#{rel}"
      else
        path = uri.path.split('/')[0..-2].select{|m| !m.empty?}.join('/')
        "#{uri.scheme}://#{uri.host}/#{path}/#{rel}"
      end
    end
    

    Then you can call:

    links_array.push full_url(url, @url)
    

    You can put the method in the same class or in a helper class somewhere. It uses the Ruby URI library to find the relevant parts of the fully qualified URL, then constructs a new one from the relative path.

    If the relative path starts with '/' it should come straight after the host.

    If it doesn't start with a '/' then it needs to be in the same virtual directory as the current page. Thus, if the current page is:

    http://www.host.com/aaa/bbb/ccc
    

    and the relative path is:

    ddd
    

    then the output should be:

    http://www.host.com/aaa/bbb/ddd
    

    however, if the relative path is:

    /ddd
    

    then the output should be:

    http://www.host.com/ddd
    

    The code:

    uri.path.split('/')[0..-2].select{|m| !m.empty?}.join('/')
    

    takes the path of the full URL, splits it on '/' giving an array (['aaa','bbb', 'ccc']), then removes the last element. (['aaa','bbb']). The select removes any blank elements then the join stitches it up again. ("aaa/bbb")

    OR

    you could do it the boring way:

    require 'uri'
    
    URI.join("http://www.host.com/aaa/bbb/ccc", "/ddd").to_s
    # => "http://www.host.com/ddd" 
    
    URI.join("http://www.host.com/aaa/bbb/ccc", "ddd").to_s
    # => "http://www.host.com/aaa/bbb/ddd" 
    

    given your code:

    links.each{|link|
        url = link['href'].nil? ? 'empty' : link['href']
        if url.include? 'category'  and !url.include? '/all'
                links_array.push url
        end
    }
    

    I would re-write as:

    links.each do |link|
      url = link['href'].nil? ? 'empty' : link['href']
      if url.include? 'category' && !url.include? '/all'
        full_url = URI.join(PAGE_URL, url).to_s 
        puts full_url
        links_array << url
        puts links_array.inspect
      end
    end
    

    Note: Stylistically, multi-line blocks should use do/end rather than {}. Indents should be two spaces. There shouldn't be spaces just inside parentheses. The << operator is favoured over push. Always use && in conditionals rather than and, which has a far lower precedence and can cause issues. See the Github style guide:

    https://github.com/styleguide/ruby

    The puts are there based on your comments, hopefully helping you figure out why your array isn't behaving. As it should be, based on the code you put in there. I'd prefer to use the debugger gem though. (Or byebug if you're on Ruby 2.x)