pythonhtmlhtml-tablehtml-parsingplaintext

Python solution to convert HTML tables to readable plain text


I am looking for a way to cleanly convert HTML tables to readable plain text.

I.e. given an input:

<table>
    <tr>
        <td>Height:</td>
        <td>200</td>
    </tr>
    <tr>
        <td>Width:</td>
        <td>440</td>
    </tr>
</table>

I expect the output:

Height: 200
Width: 440

I would prefer not using external tools, e.g. w3m -dump file.html, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.

I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.

Update

This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.


Solution

  • How about using this:

    Parse HTML table to Python list?

    But, use collections.OrderedDict() instead of simple dictionary to preserve order. After you have a dictionary, it is very-very easy to get and format the text from it:

    Using the solution of @Colt 45:

    import xml.etree.ElementTree
    import collections
    
    s = """\
    <table>
        <tr>
            <th>Height</th>
            <th>Width</th>
            <th>Depth</th>
        </tr>
        <tr>
            <td>10</td>
            <td>12</td>
            <td>5</td>
        </tr>
        <tr>
            <td>0</td>
            <td>3</td>
            <td>678</td>
        </tr>
        <tr>
            <td>5</td>
            <td>3</td>
            <td>4</td>
        </tr>
    </table>
    """
    
    table = xml.etree.ElementTree.XML(s)
    rows = iter(table)
    headers = [col.text for col in next(rows)]
    for row in rows:
        values = [col.text for col in row]
        for key, value in collections.OrderedDict(zip(headers, values)).iteritems():
            print key, value
    

    Output:

    Height 10
    Width 12
    Depth 5
    Height 0
    Width 3
    Depth 678
    Height 5
    Width 3
    Depth 4