[SOLVED] Pandas read_html() with table containing html elements

Pandas read_html() with table containing html elements

I have the following HTML table:

<table>
 <thead>
   <th> X1 </th>
   <th> X2 </th>
</thead>
<tbody>
   <tr>
    <td>Test</td>
    <td><span style="..."> Test2 </span> </td>
  </tr>
</tbody>
</table>

that I would want to parse to a dataframe by using pd.read_html(). The output is as follows:

X1	X2
Test	Test2

However, I would prefer the following output (preserving HTML elements within a cell):

X1	X2
Test	<span style="..."> Test2 </span>

Is this possible with pd.read_html()?

I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.

Solution

You could modify ._text_getter() if you really wanted to.

Something like:

import lxml.html
import pandas as pd

html = """
<table> 
<thead> 
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody> 
<tr>   
<td>Test</td>   
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""

def custom_text_getter(self, obj):
   result = obj.xpath("node()")[0]
   if isinstance(result, lxml.html.HtmlElement):
      result = lxml.html.tostring(result, encoding="unicode")
   return result

pd.io.html._LxmlFrameParser._text_getter = custom_text_getter

print(
    pd.read_html(html)[0]
)

     X1                                X2
0  Test  <span style="..."> Test2 </span>