pythonhtmlpandasparsinghtml-parsing

Pandas read_html() with table containing html elements


I have the following HTML table:

<table>
 <thead>
   <th> X1 </th>
   <th> X2 </th>
</thead>
<tbody>
   <tr>
    <td>Test</td>
    <td><span style="..."> Test2 </span> </td>
  </tr>
</tbody>
</table>

that I would want to parse to a dataframe by using pd.read_html(). The output is as follows:

X1 X2
Test Test2

However, I would prefer the following output (preserving HTML elements within a cell):

X1 X2
Test <span style="..."> Test2 </span>

Is this possible with pd.read_html()?

I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.


Solution

  • You could modify ._text_getter() if you really wanted to.

    Something like:

    import lxml.html
    import pandas as pd
    
    html = """
    <table> 
    <thead> 
    <th> X1 </th>
    <th> X2 </th>
    </thead>
    <tbody> 
    <tr>   
    <td>Test</td>   
    <td><span style="..."> Test2 </span> </td>
    </tr>
    </tbody>
    </table>
    """
    
    def custom_text_getter(self, obj):
       result = obj.xpath("node()")[0]
       if isinstance(result, lxml.html.HtmlElement):
          result = lxml.html.tostring(result, encoding="unicode")
       return result
    
    pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
    
    print(
        pd.read_html(html)[0]
    )
    
         X1                                X2
    0  Test  <span style="..."> Test2 </span>