ruby-on-railsrubyruby-on-rails-3hpricot

ROR/Hpricot: parsing a site and searching/comparing strings with regex


I just started with Ruby On Rails, and want to create a simple web site crawler which:

  1. Goes through all the Sherdog fighters' profiles.
  2. Gets the Referees' names.
  3. Compares names with the old ones (both during the site parsing and from the file).
  4. Prints and saves all the unique names to the file.

An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500

I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:

  1. The date.
  2. "N/A" when the referee name is not known.

I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:

require 'rubygems'
require 'hpricot'
require 'simplecrawler'

# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]

# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)

(hdoc/"td/span[@class='sub_line']").each do |span|
  if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
    # puts "Test"
  else
    puts span.inner_html
    #File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) } 
  end
end
}

I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?


Edit:

After some proposed improvements, here is what I got:

require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'

sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1

sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}

Unfortunately, the code still doesn't work - it returns a blank.

If instead of doc = Nokogiri::HTML(document.data), I write doc = Nokogiri::HTML(open(document.data)), then it gives me the whole page, but, parsing still doesn't work.


Solution

  • You would use array math (-) to compare them:

    get referees from the current page

    current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']
    

    read old referees from the file

    old_referees = File.read('old_referees.txt').split("\n")
    

    use Array#- to compare them

    new_referees = current_referees - old_referees
    

    write the new file

    File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}