I have the following HTML table:
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
that I would want to parse to a dataframe by using pd.read_html(). The output is as follows:
| X1 | X2 |
|---|---|
| Test | Test2 |
However, I would prefer the following output (preserving HTML elements within a cell):
| X1 | X2 |
|---|---|
| Test | <span style="..."> Test2 </span> |
Is this possible with pd.read_html()?
I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.
You could modify ._text_getter() if you really wanted to.
Something like:
import lxml.html
import pandas as pd
html = """
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""
def custom_text_getter(self, obj):
result = obj.xpath("node()")[0]
if isinstance(result, lxml.html.HtmlElement):
result = lxml.html.tostring(result, encoding="unicode")
return result
pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
print(
pd.read_html(html)[0]
)
X1 X2
0 Test <span style="..."> Test2 </span>