cssruby-on-railsrubycss-selectorsnokogiri

How do I find direct children and not nested children using Rails and Nokogiri?


I’m using Rails 4.2.7 with Ruby (2.3) and Nokogiri. How do I find the most direct tr children of a table, as opposed to nested ones? Currently I find table rows within a table like so …

  tables = doc.css('table')
  tables.each do |table|
    rows = table.css('tr')

This not only finds direct rows of a table, e.g.

<table>
    <tbody>
        <tr>…</tr>

but it also finds rows within rows, e.g.

<table>
    <tbody>
        <tr>
            <td>
                <table>
                    <tr>This is found</tr>
                </table>
            </td>
        </tr>

How do I refine my search to only find the direct tr elements?


Solution

  • You can do it in a couple of steps using XPath. First you need to find the “level” of the table (i.e. how nested it is in other tables), then find all descendant tr that have the same number of table ancestors:

    tables = doc.xpath('//table')
    tables.each do |table|
      level = table.xpath('count(ancestor-or-self::table)')
      rows = table.xpath(".//tr[count(ancestor::table) = #{level}]")
      # do what you want with rows...
    end
    

    In the more general case, where you might have tr nested directly other trs, you could do something like this (this would be invalid HTML, but you might have XML or some other tags):

    tables.each do |table|
      # Find the first descendant tr, and determine its level. This
      # will be a "top-level" tr for this table. "level" here means how
      # many tr elements (including itself) are between it and the
      # document root.
      level = table.xpath("count(descendant::tr[1]/ancestor-or-self::tr)")
      # Now find all descendant trs that have that same level. Since
      # the table itself is at a fixed level, this means all these nodes
      # will be "top-level" rows for this table.
      rows = table.xpath(".//tr[count(ancestor-or-self::tr) = #{level}]")
      # handle rows...
    end
    

    The first step could be broken into two separate queries, which may be clearer:

    first_tr = table.at_xpath(".//tr")
    level = first_tr.xpath("count(ancestor-or-self::tr)")
    

    (This will fail if there is a table with no trs though, as first_tr will be nil. The combined XPath above handles that situation correctly.)