ruby-on-railsrubynokogirihpricot

How do I use Hpricot to search the inner_text of all elements?


I would like to use Hpricot to scan the inner_text of all elements, and know what element is currently being scanned. However, each approach I have taken leads to a recursion. Is there a built-in function to do this with Hpricot (or Nokogiri)? The code below just scans one level down:

@t = []
doc = Hpricot(open("some html doc"))
(doc/"html").each do |e|
  e.children.each do |child|
    if child.is_a?(Hpricot::Text)
      @t << child.to_s.strip
    end
  end
end

Solution

  • Although I'm not sure exactly why you want to collect all text nodes (perhaps there is a more efficient solution), this should get you started:

    require 'nokogiri'
    doc = Nokogiri::HTML(open('doc'))
    
    doc.at_css("body").traverse do |node|
      puts "***#{node.name}"
      puts node.text
    end
    

    It uses Nokogiri's traverse which will visit all nodes under your starting node.