I have this code. I am trying to extract data from this website into pandas.
from pyquery import PyQuery as pq
import requests
import pandas as pd
url = "https://www.tsa.gov/travel/passenger-volumes"
content = requests.get(url).content
doc = pq(content)
Passengers = doc(".views-align-center").text()
Method 1:
df = pd.DataFrame([x.split(' ') for x in Passengers.split(' ')])
print(df)
Method 2:
Passengers = Passengers.replace(' ',';')
Passengers
For Method 1, is it possible to do pandas data frame unstack to get proper table structure?
Or is it better to do Method 2? How to split string periodically and load into pandas?
You can do this directly in Pandas:
import pandas as pd
import numpy as np
import requests
url = "https://www.tsa.gov/travel/passenger-volumes"
html = requests.get(url).content
df_list = pd.read_html(html) # gives a list of the DFs extracted
print(df_list[0])
which gives the DataFrame:
Date 2023 2022 2021 2020 2019
0 6/1/2023 2463873.0 2228271 1815931 391882 2623947
1 5/31/2023 2255052.0 2023231 1587910 304436 2370152
2 5/30/2023 2342489.0 2114935 1682752 267742 2247421
3 5/29/2023 2577437.0 2319237 1900170 353261 2499002
4 5/28/2023 2257766.0 2103022 1650454 352947 2555578
.. ... ... ... ... ... ...
359 6/7/2022 NaN 2052377 1560561 338382 2433189
360 6/6/2022 NaN 2279743 1828396 430414 2644981
361 6/5/2022 NaN 2387196 1984658 441255 2669860
362 6/4/2022 NaN 1981408 1681192 353016 2225952
363 6/3/2022 NaN 2332592 1879885 419675 2649808
[364 rows x 6 columns]
The NaN values in 2023 force the float dtype but you can then clean the data as required. For example:
df = df_list[0]
df['2023']= df['2023'].fillna(0).astype(int)