I have this code to extract some data online using python API.
import urllib.request
url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en¶m=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
with urllib.request.urlopen(url_path) as url:
s = url.read()
How to make this data dump into useful pandas data frame format?
The returned data is a JSON document. This converts the dataset in the JSON data, but there is other metadata in data
:
import urllib.request
import json
import pandas as pd
url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en¶m=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
with urllib.request.urlopen(url_path) as url:
s = url.read()
data = json.loads(s)
df = pd.DataFrame(data['dataSet'])
print(df)
I prefer the 3rd party requests
module:
import requests
import pandas as pd
url_path = "https://www.censtatd.gov.hk/api/get.php?id=660-69001&lang=en¶m=N4KABGBEDGBukC4yghSBJAcgEUWA2uKmgJYB2AJgPoCKAQgErpUDK6kANEcZOdfU1a4uxUpSoAWAKyduqXuIkA2WaKh9JAdlWiF-RswkBOKlIAMpgIymATDZ08Nxh-I0DmAdQAqLsfsEAgj4iuhpSNpa+6uJSAMz2IY4xMomuMSqpflQAUlF6VEqRmdHUUkZFcllKEpY2BTWxeRrVEvUmSoUFSo3F+e5UAGIBTeL9ASwj1EpSXQmVJbQGVAAyk1QAsmuYa-1eAPI7SwDCB3IAukQAviGQAM7wSCjyBniEapAMAIYA7lSWFAAHKgkCgADyoZEg5yuNwBAFMAE4kAD2FDwTzQtwALp8EVi8JAbGYzOZIjCiLw0UhIB0zABaJRGYkVNAAG0+ZAA5gS4ZCQJcgA"
response = requests.get(url_path)
if response.ok:
data = response.json()
df = pd.DataFrame(data['dataSet'])
print(df)
else:
print(response)
Output:
IND INDDesc ... figure sd_value
0 ind_45 Import/export trade ... 73.4
1 ind_46 Wholesale ... 61.8
2 ind_47 Retail ... 43
3 ind_49 <i>Land transport<i> ... 60.1
4 ind_521 Warehousing and storage ... 35.4
... ... ... ... ... ...
2155 ind_QBRI_CO Computer and information technology services ... 106.1 p
2156 ind_QBRI_FA <i>Financial markets and asset management<i> ... 113.6 p
2157 ind_QBRI_L Real estate ... 130.4 p
2158 ind_QBRI_TO Tourism, convention and exhibition services ... 17 p
2159 ind_QBRI_WT <i>Water transport<i> ... 162.8 p
[2160 rows x 8 columns]