ruby-on-railsweb-scraping

Scraping all of the URLS of a page


I have a snippet that can scrape images succesfully in a URL, the thing is, I want to gather lots of images from various websites, and I can't be putting the URL manually everytime.

Since I'm new to scraping, how do you guys face this? What is the best way to scrape every URL? Do I need to have the URLS in a CSV or something? Is it automatic?

My script

URL = 'http://www.sitasde.com'

  require 'rubygems'
  require 'nokogiri'
  require 'open-uri'
  require 'uri'

  def make_absolute( href, root )
    URI.parse(root).merge(URI.parse(href)).to_s
  end

  Nokogiri::HTML(open(URL)).xpath("//img/@src").each do |src|
    uri = make_absolute(src,URL)
    File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
  end

Solution

  • You need to specify a pattern.

    One way is (like how google works), you can also detect all the anchor links (), and add those links to queue (like in array). Once you are done with scraping images on current page, remove it from array, move on to the next page in array, repeat the same process (find all links, push those to array, save images on current link, remove current link from array). Repeat this until array has length>0.

    But there can be a problem, like memory issue in case of large websites. So you can also set time limit and memory limit. Or put a limit in your code itself. Like restrict it to same website, and set maximum limit in array to say 100. If you are doing in parts, keep record of those URLs so you don't scrape them again in future.

    I would recommend using a database to keep track of urls scraped.