I would like to use Hpricot to scan the inner_text
of all elements, and know what element is currently being scanned. However, each approach I have taken leads to a recursion. Is there a built-in function to do this with Hpricot (or Nokogiri)? The code below just scans one level down:
@t = []
doc = Hpricot(open("some html doc"))
(doc/"html").each do |e|
e.children.each do |child|
if child.is_a?(Hpricot::Text)
@t << child.to_s.strip
end
end
end
Although I'm not sure exactly why you want to collect all text nodes (perhaps there is a more efficient solution), this should get you started:
require 'nokogiri'
doc = Nokogiri::HTML(open('doc'))
doc.at_css("body").traverse do |node|
puts "***#{node.name}"
puts node.text
end
It uses Nokogiri's traverse
which will visit all nodes under your starting node.