htmlrubyweb-scrapingnokogiri

How do I parse deeply nested text from Wikipedia using Nokogiri?


I'm trying to scrape and get a list of all player names from http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters

Here's are my newbie code bits:

class AllPlayersScraper
  attr_accessor :players, :names, :links

  def initialize(url)
    @players = Nokogiri::HTML(open(url))
  end

  def get_names
    @names = @players.css('table[class^="sortable"]')

    # @names = @players.css("div.span2 a").href
  end
end


require_relative './config/environment.rb'
rawfeed = "http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters"
scraper = AllPlayersScraper.new(rawfeed)
nbalist = scraper.get_names

Here's the chunk of HTML I'm having trouble with. I'm not sure how to drill into the third <td> that I need.

<table class="sortable jquery-tablesorter" style=....>
<thead> 
   // bunch of html...
</thead>
<tbody>
    <tr>
    <td style="text-align:center;"><span style="display:none" class="sortkey">5.5 !</span><span class="sorttext"><a href="/wiki/Forward-center" title="Forward-center">F/C</a></span></td>
    <td style="text-align:center;">50</td>
    <td style="text-align:left;"><a href="/wiki/Lavoy_Allen" title="Lavoy Allen">Allen, Lavoy</a></td>
    <td><span style="display:none" class="sortkey">81 !</span><span class="sorttext">6 ft 9 in</span> (2.06&#160;m)</td>
    <td>255 lb (116&#160;kg)</td>
    <td style="text-align:center;">1989–02–04</td>
    <td><a href="/wiki/Temple_University" title="Temple University">Temple</a></td>
    </tr>

Thanks!


Solution

  • Long time without using Nokogiri, but this works:

    rawfeed = "http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters"
    @page = Nokogiri::HTML(open(rawfeed))
    
    @all_teams = @page.css('table.toccolours')
    
    @parsed_teams = []
    @all_teams.each do |t|
      team = {}
    
      # team name
      team["name"] = (t.css('tr')[0].css('b').text).gsub(" roster", "")
    
      team_players_rows = t.css('table.sortable tr')
      team["players"] = []
    
      # Skip header and iterate over players
      team_players_rows.drop(1).each do |tp|
        team["players"].push(tp.css('td')[2].css('a').text)
      end
    
    @parsed_teams << team
    end
    

    @parsed_teams would be an array with values like:

    [{"name"=>"Boston Celtics", 
     "players"=>["Bass, Brandon", "Bogans, Keith", "Bradley, Avery", 
     "Brooks, MarShon", "Crawford, Jordan", "Faverani, Vítor", "Green, 
     Jeff", "Humphries, Kris", "Lee, Courtney", "Olynyk, Kelly", "Pressey, Phil", 
     "Rondo, Rajon", "Sullinger, Jared", "Wallace, Gerald"]},       
     {"name"=>"Brooklyn Nets",...]