python-3.xpandasurllib

How to make sense of this data extraction and convert it to pandas dataframe?


I have this code to extract some data online using python API.

import urllib.request
url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en&param=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
with urllib.request.urlopen(url_path) as url:
    s = url.read()

How to make this data dump into useful pandas data frame format?


Solution

  • The returned data is a JSON document. This converts the dataset in the JSON data, but there is other metadata in data:

    import urllib.request
    import json
    import pandas as pd
    
    url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en&param=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
    with urllib.request.urlopen(url_path) as url:
        s = url.read()
    
    data = json.loads(s)
    df = pd.DataFrame(data['dataSet'])
    print(df)
    

    I prefer the 3rd party requests module:

    import requests
    import pandas as pd
    
    url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en&param=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
    response = requests.get(url_path)
    if response.ok:
        data = response.json()
        df = pd.DataFrame(data['dataSet'])
        print(df)
    else:
        print(response)
    

    Output:

                  IND                                       INDDesc  ... figure sd_value
    0          ind_45                           Import/export trade  ...   73.4         
    1          ind_46                                     Wholesale  ...   61.8         
    2          ind_47                                        Retail  ...     43         
    3          ind_49                          <i>Land transport<i>  ...   60.1         
    4         ind_521                       Warehousing and storage  ...   35.4         
    ...           ...                                           ...  ...    ...      ...
    2155  ind_QBRI_CO  Computer and information technology services  ...  106.1        p
    2156  ind_QBRI_FA  <i>Financial markets and asset management<i>  ...  113.6        p
    2157   ind_QBRI_L                                   Real estate  ...  130.4        p
    2158  ind_QBRI_TO   Tourism, convention and exhibition services  ...     17        p
    2159  ind_QBRI_WT                         <i>Water transport<i>  ...  162.8        p
    
    [2160 rows x 8 columns]