pythonpandas

Flatten data from Pandas read_html but extracting the links


I have a html table like the following:

html_text = "<table>
  <tr>
    <th>Home</th>
    <th>Score</th>
    <th>Away</th>
    <th>Report</th>
  </tr>
  <tr>
    <td>Arsenal</td>
    <td></td>
    <td>Manchester Utd</td>
    <td></td>
  </tr>
  <tr>
    <td>Everton</td>
    <td>2-0</td>
    <td>Liverpool</td>
    <td><a href="/asdasdasd/">Match Report</a></td>
  </tr>
</table>"

I'm loading it via requests and using Panda to convert to a dataset:

matches = pd.read_html(StringIO(str(html_text)), extract_links="all")[0]

This now gives me the following:

(Home, None),(Score, None),(Away, None),(Report, None)
(Arsenal, None),(NaN, None),(Manchester Utd, None),(NaN, None)
(Everton, None),(2-0, None),(Liverpool, None),(Match Report, /asdasdasd/)

What is the simplest way to flatten the dataset to (keeping only link in the "Report" column and the text elsewhere):

Home, Score, Away, Report
Arsenal, NaN, Manchester Utd, NaN
Everton, 2-0, Liverpool, /asdasdasd/

Solution

  • Since all of the columns of interest except Report are tuples where the data of interest is the first element, extract those first elements from those columns. For the Report column, extract the second element. Also, use extract_links='body' to prevent creating tuples from the headers.

    matches = pd.read_html(html_text, extract_links='body')[0]
    
    for col in matches.columns.difference(['Report']):
        matches[col] = matches[col].apply(lambda x: x[0]).replace('', None)
        
    matches['Report'] = matches['Report'].apply(lambda x: x[1])
    
          Home Score            Away       Report
    0  Arsenal  None  Manchester Utd         None
    1  Everton   2-0       Liverpool  /asdasdasd/