The code runs up until the find_all instances are called. Hoping to be able to extract the data from the online table for some data science. Any Help would be greatly appreciated. `
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup
# Obtain information from tag <table>
table1 = soup.find("table", id='tbl-datatable')
table1
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
`
So far I have tried looking at any syntax errors and other methods of extracting the information into an excel file but i have had no luck so far.
The code runs up until the find_all instances are called
Nope, it does not…
It brings up error at find_all
, it does not mean though that it runs fine up to this point.
In my case:
soup
was totally empty as requests return response code 451
which is an error code. Adding User-Agent in headers solves this issue.
table1
was not found when parsing with lxml
, one has to use html5
on this particular page.
Why though? There might be many reasons for different behavior between parsers. Generally, as a rule of thumb, if lxml
and html.parser
don't bring you a desired result, try html5lib
as it is more failsafe and "correct" yet not that fast.
Now we are done with the initial error, yet there are a few more things to consider:
Your 2nd loop will result in an error as it will include <tfoot>
table row that is completely different from <tbody>
rows and the resulting DataFrame.
You should either cycle through <tbody>
rows only or escape ValueError by adding an if clause to check for row column mistmatch:
if len(row) != len(mydata.columns): break
Not an error, but this table has some empty column names and cells that we don't need in a resulting DataFrame as it's rubbish. This can be filtered out in advance by checking for column names length. As for rows, check for empty cells and ones with unprintable Unicode char.
That's it.
Ohh, and btw, use either 4 spaces or tab ident, one space looks awful.
Here is a working solution and output:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
url = r'https://www.flightradar24.com/data/aircraft/ja11jc'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0)'}
r = requests.get('https://www.flightradar24.com/data/aircraft/ja11jc', headers=headers)
soup = bs(r.content,'html5lib')
table = soup.find("table", id='tbl-datatable')
skip_column = 1
thead = table.find('thead')
headers = []
for i in thead.find_all('th')[skip_column:]:
headers.append(i.text) if len(i.text) > 1 else None
df = pd.DataFrame(columns=headers)
tbody = table.find('tbody')
for j in tbody.find_all('tr'):
row = j.find_all('td')
row = [re.sub(r'\s+',' ',i.text.strip(' ')) for i in row if i.text != '' and '\xa0' not in i.text]
df.loc[len(df)] = row[skip_column:]
df
DATE FROM TO FLIGHT FLIGHT TIME STD ATD STA STATUS
0 11 Dec 2022 Tokunoshima (TKN) Kagoshima (KOJ) JL3798 — 10:00 — 11:10 Scheduled
1 11 Dec 2022 Amami (ASJ) Tokunoshima (TKN) JL3843 — 08:55 — 09:30 Scheduled
2 11 Dec 2022 Tokunoshima (TKN) Amami (ASJ) JL3844 — 07:45 — 08:15 Scheduled
... ... ... ... ... ... ... ... ... ...
65 04 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3465 1:05 05:25 05:26 06:40 Landed 06:31
66 04 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3724 0:53 01:45 01:50 02:50 Landed 02:43
67 04 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3725 0:56 00:00 00:05 01:15 Landed 01:02
other methods of extracting the information into an excel file but i have had no luck so far
To save the resulting df
to an excel file run:
df.to_excel("ja11jc.xlsx")