htmlxmlxpathhtml-tablexpath-3.1

XPath for colspan attribute values as browser understands them?


I have a HTML table with possibly malformed or missing or duplicated colspan values:

<table border="1">
  <tbody>
    <tr>
      <th>A</th> <th>B</th> <th>C</th> <th>D</th> <th>E</th>
      <th>F</th> <th>G</th> <th>H</th> <th>I</th> <th>J</th>
      <th>K</th> <th>L</th> <th>M</th> <th>N</th> <th>O</th>
      <th>P</th> <th>Q</th> <th>R</th> <th>S</th> <th>T</th>
      <th>U</th> <th>V</th> <th>W</th> <th>X</th> <th>Z</th>
    </tr>
    <tr>
      <td                         >  0 </td>
      <td colspan=""              >  1 </td>
      <td colspan="0"             >  2 </td>
      <td colspan="2"             >  3 </td>
      <td colspan="-2"            >  4 </td>
      <td colspan="+ 2"           >  5 </td>
      <td colspan="+2"            >  6 </td>
      <td colspan="*2#%@!"        >  7 </td>
      <td colspan="2.7"           >  8 </td>
      <td colspan="-2.3"          >  9 </td>
      <td colspan="2e1"           > 10 </td>
      <td colspan=" 2 "           > 11 </td>
      <td colspan="2xx"           > 12 </td>
      <td colspan="2 3"           > 13 </td>
      <td colspan="2" colspan="3" > 14 </td>
      <td colspan="++2"           > 15 </td>
    </tr>
  </tbody>
</table>

I would like to get the colspan value of each td as a HTML5 browser would "understand" them. I'm currently trying to figure out what the W3C specification says about it, but let's consider that the results displayed by running the above snippet are my expected output:

colspan HTML5 value
missing 1
"" 1
"0" 1
"2" 2
"-2" 1
"+ 2" 1
"+2" 2
"*2#%@!" 1
"2.7" 2
"-2.3" 1
"2e1" 2
" 2 " 2
"2xx" 2
"2 3" 2
"2" & "3" 2
"++2" 1

How can I achieve it using XPath 3.1?


I came up with this XPath expression:

//td/( (1, @colspan[. castable as xs:double]) => max() => xs:integer() )

But it has a few issues like converting "2e1" to 20 instead of 2.


Solution

  • Consider regex pattern matching the value to extract leading digit characters, ignoring all characters beginning with the first non-digit character. Then successful match yields the leading integer; all else yields 1:

    //td/(if (matches(@colspan[1],'^\s*\+?[1-9]\d*')) 
              then replace(@colspan[1], '^\s*\+?(\d+).*$', '$1') 
              else '1')
    

    Update: Now handles question update that added cases, colspan="0" and colspan="+2".

    Update 2: Added [1] to @colspan per Fravadona observation that (not-well-formed) HTML tables might have multiple @colspan attributes.