I have a html table like the following:
html_text = "<table>
<tr>
<th>Home</th>
<th>Score</th>
<th>Away</th>
<th>Report</th>
</tr>
<tr>
<td>Arsenal</td>
<td></td>
<td>Manchester Utd</td>
<td></td>
</tr>
<tr>
<td>Everton</td>
<td>2-0</td>
<td>Liverpool</td>
<td><a href="/asdasdasd/">Match Report</a></td>
</tr>
</table>"
I'm loading it via requests and using Panda to convert to a dataset:
matches = pd.read_html(StringIO(str(html_text)), extract_links="all")[0]
This now gives me the following:
(Home, None),(Score, None),(Away, None),(Report, None)
(Arsenal, None),(NaN, None),(Manchester Utd, None),(NaN, None)
(Everton, None),(2-0, None),(Liverpool, None),(Match Report, /asdasdasd/)
What is the simplest way to flatten the dataset to (keeping only link in the "Report" column and the text elsewhere):
Home, Score, Away, Report
Arsenal, NaN, Manchester Utd, NaN
Everton, 2-0, Liverpool, /asdasdasd/
Since all of the columns of interest except Report
are tuples where the data of interest is the first element, extract those first elements from those columns. For the Report
column, extract the second element. Also, use extract_links='body'
to prevent creating tuples from the headers.
matches = pd.read_html(html_text, extract_links='body')[0]
for col in matches.columns.difference(['Report']):
matches[col] = matches[col].apply(lambda x: x[0]).replace('', None)
matches['Report'] = matches['Report'].apply(lambda x: x[1])
Home Score Away Report
0 Arsenal None Manchester Utd None
1 Everton 2-0 Liverpool /asdasdasd/