I have a HTML table with possibly malformed or missing or duplicated colspan
values:
<table border="1">
<tbody>
<tr>
<th>A</th> <th>B</th> <th>C</th> <th>D</th> <th>E</th>
<th>F</th> <th>G</th> <th>H</th> <th>I</th> <th>J</th>
<th>K</th> <th>L</th> <th>M</th> <th>N</th> <th>O</th>
<th>P</th> <th>Q</th> <th>R</th> <th>S</th> <th>T</th>
<th>U</th> <th>V</th> <th>W</th> <th>X</th> <th>Z</th>
</tr>
<tr>
<td > 0 </td>
<td colspan="" > 1 </td>
<td colspan="0" > 2 </td>
<td colspan="2" > 3 </td>
<td colspan="-2" > 4 </td>
<td colspan="+ 2" > 5 </td>
<td colspan="+2" > 6 </td>
<td colspan="*2#%@!" > 7 </td>
<td colspan="2.7" > 8 </td>
<td colspan="-2.3" > 9 </td>
<td colspan="2e1" > 10 </td>
<td colspan=" 2 " > 11 </td>
<td colspan="2xx" > 12 </td>
<td colspan="2 3" > 13 </td>
<td colspan="2" colspan="3" > 14 </td>
<td colspan="++2" > 15 </td>
</tr>
</tbody>
</table>
I would like to get the colspan
value of each td
as a HTML5 browser would "understand" them. I'm currently trying to figure out what the W3C specification says about it, but let's consider that the results displayed by running the above snippet are my expected output:
colspan |
HTML5 value |
---|---|
missing | 1 |
"" |
1 |
"0" |
1 |
"2" |
2 |
"-2" |
1 |
"+ 2" |
1 |
"+2" |
2 |
"*2#%@!" |
1 |
"2.7" |
2 |
"-2.3" |
1 |
"2e1" |
2 |
" 2 " |
2 |
"2xx" |
2 |
"2 3" |
2 |
"2" & "3" |
2 |
"++2" |
1 |
How can I achieve it using XPath 3.1?
I came up with this XPath expression:
//td/( (1, @colspan[. castable as xs:double]) => max() => xs:integer() )
But it has a few issues like converting "2e1"
to 20
instead of 2
.
Consider regex pattern matching the value to extract leading digit characters, ignoring all characters beginning with the first non-digit character. Then successful match yields the leading integer; all else yields 1:
//td/(if (matches(@colspan[1],'^\s*\+?[1-9]\d*'))
then replace(@colspan[1], '^\s*\+?(\d+).*$', '$1')
else '1')
Update: Now handles question update that added cases, colspan="0"
and colspan="+2"
.
Update 2: Added [1]
to @colspan
per Fravadona observation that (not-well-formed) HTML tables might have multiple @colspan
attributes.