pythonpandaspython-requestspyquery

How to extract the table values and load into pandas data frame?


I have this code. I am trying to extract data from this website into pandas.

from pyquery import PyQuery as pq
import requests
import pandas as pd

url = "https://www.tsa.gov/travel/passenger-volumes"
content = requests.get(url).content
doc = pq(content)
Passengers = doc(".views-align-center").text()

Method 1:

df = pd.DataFrame([x.split(' ') for x in Passengers.split(' ')]) 
print(df)

Method 2:

Passengers = Passengers.replace(' ',';')
Passengers

For Method 1, is it possible to do pandas data frame unstack to get proper table structure?

Or is it better to do Method 2? How to split string periodically and load into pandas?


Solution

  • You can do this directly in Pandas:

    import pandas as pd
    import numpy as np
    import requests
    url = "https://www.tsa.gov/travel/passenger-volumes"
    html = requests.get(url).content
    df_list = pd.read_html(html)   # gives a list of the DFs extracted
    print(df_list[0])
    

    which gives the DataFrame:

              Date       2023     2022     2021    2020     2019
    0     6/1/2023  2463873.0  2228271  1815931  391882  2623947
    1    5/31/2023  2255052.0  2023231  1587910  304436  2370152
    2    5/30/2023  2342489.0  2114935  1682752  267742  2247421
    3    5/29/2023  2577437.0  2319237  1900170  353261  2499002
    4    5/28/2023  2257766.0  2103022  1650454  352947  2555578
    ..         ...        ...      ...      ...     ...      ...
    359   6/7/2022        NaN  2052377  1560561  338382  2433189
    360   6/6/2022        NaN  2279743  1828396  430414  2644981
    361   6/5/2022        NaN  2387196  1984658  441255  2669860
    362   6/4/2022        NaN  1981408  1681192  353016  2225952
    363   6/3/2022        NaN  2332592  1879885  419675  2649808
    
    [364 rows x 6 columns]
    

    The NaN values in 2023 force the float dtype but you can then clean the data as required. For example:

    df = df_list[0]
    df['2023']= df['2023'].fillna(0).astype(int)