rubynokogiriscreen-scraping

Using Nokogiri to find element before another element


I have a partial HTML document:

<h2>Destinations</h2>
<div>It is nice <b>anywhere</b> but here.
<ul>
  <li>Florida</li>
  <li>New York</li>
</ul>
<h2>Shopping List</h2>
<ul>
  <li>Booze</li>
  <li>Bacon</li>
</ul>

On every <li> item, I want to know the category the item is in, e.g., the text in the <h2> tags.

This code does not work, but this is what I'm trying to do:

@page.search('li').each do |li|
  li.previous('h2').text
end

Solution

  • Nokogiri allows you to use xpath expressions to locate an element:

    categories = []
    
    doc.xpath("//li").each do |elem|
      categories << elem.parent.xpath("preceding-sibling::h2").last.text
    end
    
    categories.uniq!
    p categories
    

    The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).

    We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).

    Using your own HTML example, this code output:

    ["Destinations", "Shopping List"]