htmlxmlxpathhtml-tablexpath-3.1

XPath for colspan attribute values as browser understands them?


I have a HTML table with possibly missing or malformed colspan values:

<table border="1">
  <tbody>
    <tr>
      <th>A</th> <th>B</th> <th>C</th> <th>D</th>
      <th>E</th> <th>F</th> <th>G</th> <th>H</th>
      <th>I</th> <th>J</th> <th>K</th> <th>L</th>
      <th>M</th> <th>N</th> <th>O</th> <th>P</th>
      <th>Q</th> <th>R</th> <th>S</th> <th>T</th>
      <th>U</th> <th>V</th> <th>W</th> <th>Y</th>
    </tr>
    <tr>
      <td                         >  0 </td>
      <td colspan=""              >  1 </td>
      <td colspan="0"             >  2 </td>
      <td colspan="2"             >  3 </td>
      <td colspan="-2"            >  4 </td>
      <td colspan="+ 2"           >  5 </td>
      <td colspan="+2"            >  6 </td>
      <td colspan="*2#%@!"        >  7 </td>
      <td colspan="2.7"           >  8 </td>
      <td colspan="-2.3"          >  9 </td>
      <td colspan="2e1"           > 10 </td>
      <td colspan=" 2 "           > 11 </td>
      <td colspan="2xx"           > 12 </td>
      <td colspan="2 3"           > 13 </td>
      <td colspan="2" colspan="3" > 14 </td>
    </tr>
  </tbody>
</table>

I would like to get the colspan value of each td as a HTML5 browser would "understand" them. I'm currently trying to figure out what the W3C specification says about it, but let's consider that the results displayed by running the above snippet are my expected output:

colspan HTML5 value
missing 1
"" 1
"0" 1
"2" 2
"-2" 1
"+ 2" 1
"+2" 2
"*2#%@!" 1
"2.7" 2
"-2.3" 1
"2e1" 2
" 2 " 2
"2xx" 2
"2 3" 2
"2" & "3" 2

How can I achieve it using XPath 3.1?


I've tried this XPath expression:

//td/( (1, @colspan[. castable as xs:double]) => max() => xs:integer() )

But it has a few issues like converting "2e1" to 20 instead of 2.


Solution

  • Consider regex pattern matching the value to extract leading digit characters, ignoring all characters beginning with the first non-digit character. Then successful match yields the leading integer; all else yields 1:

    //td/(if (matches(@colspan[1],'^\s*\+?[1-9]\d*')) 
              then replace(@colspan[1], '^\s*\+?(\d+).*$', '$1') 
              else '1')
    

    Update: Now handles question update that added cases, colspan="0" and colspan="+2".

    Update 2: Added [1] to @colspan per Fravadona observation that (not-well-formed) HTML tables might have multiple @colspan attributes.