pythonjsondatabasebeautifulsoupdata-cleaning

How can I scrape data efficiently and clean it?


I scraped data from a website and I am having trouble cleaning this is the code, I scraped the data with. Is that best practice?

import requests
from bs4 import BeautifulSoup
import json

all_countries_links=[]
countries= []
all_data=[]
data_dict={}
data_value=[]
page1 = requests.get(f"https://data.un.org/")

def main(page):
    source = page.content
    soup = BeautifulSoup(source,'lxml')
    all_page = soup.find("div",{"class","CountryList"}).find_all('a',href=True)
    for link in all_page:
        all_countries_links.append(link['href'])
        countries. append(link.text.strip())

def scrape_country(all_countries_links,countries):
     for country in all_countries_links[:2]:
        page2 = requests.get(f"https://data.un.org/{country}") 
        source = page2.content
        soup = BeautifulSoup(source,'lxml')
        all_page= soup.find('ul',{'class','pure-menu-list'})
        tables = all_page.contents
        for table in tables:
            line = table.text.strip()
            all_data.append(line)
main(page1)
scrape_country(all_countries_links,countries)
file_path = "data.json"
with open(file_path, 'w') as f:
    json.dump(all_data, f, indent=4) 
print(f"Data saved to {file_path}")

This is a small example of the data after collecting it.

[
    "",
    "General Information\n\nRegion\u00a0\n\u00a0\nSouthern Asia\nPopulation\u00a0(000, 2021)\n\u00a0\n39 835a\nPop. density\u00a0(per km2, 2021)\n\u00a0\n61a\nCapital city\u00a0\n\u00a0\nKabul\nCapital city pop.\u00a0(000, 2021)\n\u00a0\n4 114.0b\nUN membership date\u00a0\n\u00a0\n19-Nov-46\nSurface area\u00a0(km2)\n\u00a0\n652 864b\nSex ratio\u00a0(m per 100 f)\n\u00a0\n105.3a\nNational currency\u00a0\n\u00a0\nAfghani (AFN)\nExchange rate\u00a0(per US$)\n\u00a0\n77.1c",   
]

I tried to separate the data with this code:

cleaned_data =[]

# for line in cleaned_data:
#     print(line.split('\n'))
# new_data = [line for line in all_data.split()]

for line in all_data[:1]:
    for line2 in line.split():
        if line2 not in ["General","Information","Economic"," indicators","Social"," indicators"]:
            cleaned_data.append(line2)

But I was hoping to find a better way.


Solution

  • For this type of task I'd recommend pandas .read_html() function:

    from io import StringIO
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    country_url = "https://data.un.org/en/iso/af.html"
    
    soup = BeautifulSoup(requests.get(country_url).content, "html.parser")
    
    for table in soup.select("details table"):
        summary = table.find_previous("summary").text
        df = pd.read_html(StringIO(str(table)))[0]
        df["table_name"] = summary
        print(df)
        print("-" * 80)
    

    Prints:

    
    ...
    
    --------------------------------------------------------------------------------
                                                   Unnamed: 0         2010         2015          2021           table_name
    0       GDP: Gross domestic product (million current US$)       14 699       18 713       17 877b  Economic indicators
    1          GDP growth rate (annual %, const. 2015 prices)          5.2         -1.4            4b  Economic indicators
    2                            GDP per capita (current US$)        503.6        543.8        469.9b  Economic indicators
    3           Economy: Agriculture (% of Gross Value Added)         33.2         27.3       26.9d,b  Economic indicators
    4              Economy: Industry (% of Gross Value Added)           13         10.8     12.8e,f,b  Economic indicators
    5         Economy: Services and other activity (% of GVA)         53.8         61.9       60.4g,b  Economic indicators
    6              Employment in agricultureh (% of employed)         54.7         47.1         42.4c  Economic indicators
    7                 Employment in industryh (% of employed)         14.4           17         18.3c  Economic indicators
    8                    Employment in servicesh (% employed)         30.9         35.8         39.4c  Economic indicators
    9                       Unemploymenth (% of labour force)         11.5         11.4         11.2c  Economic indicators
    10  Labour force participation rateh (female/male pop. %)  14.9 / 78.4  18.8 / 76.2  21.8 / 74.6c  Economic indicators
    11                   CPI: Consumer Price Index (2010=100)          100          133          150b  Economic indicators
    12          Agricultural production index (2014-2016=100)           93           96          111b  Economic indicators
    13     International trade: exports (million current US$)          388          571      1 022h,c  Economic indicators
    14     International trade: imports (million current US$)        5 154        7 723      9 683h,c  Economic indicators
    15     International trade: balance (million current US$)      - 4 766      - 7 151    - 8 661h,c  Economic indicators
    16     Balance of payments, current account (million US$)         -578      - 4 193      - 3 137c  Economic indicators
    --------------------------------------------------------------------------------
                                                              Unnamed: 0          2010          2015           2021         table_name
    0                         Population growth ratei (average annual %)           2.6           3.3           2.5c  Social indicators
    1                           Urban population (% of total population)          23.7          24.8          25.8b  Social indicators
    2                   Urban population growth ratei (average annual %)           3.7             4            ...  Social indicators
    3                     Fertility rate, totali (live births per woman)           6.5           5.4           4.6c  Social indicators
    4                   Life expectancy at birthi (females/males, years)   61.0 / 58.3   63.8 / 60.9   65.8 / 62.8c  Social indicators
    5                Population age distribution (0-14/60+ years old, %)    48.2 / 3.9    44.9 / 4.0    41.2 / 4.3a  Social indicators
    6                 International migrant stockj (000/% of total pop.)   102.3 / 0.4   339.4 / 1.0   144.1 / 0.4c  Social indicators
    7                      Refugees and others of concern to UNHCR (000)      1 200.0k       1 421.4       2 802.9c  Social indicators
    8                     Infant mortality ratei (per 1 000 live births)          72.2          60.1          51.7c  Social indicators
    9                             Health: Current expenditure (% of GDP)           8.6          10.1           9.4l  Social indicators
    10                               Health: Physicians (per 1 000 pop.)           0.2           0.3           0.3m  Social indicators
    11                      Education: Government expenditure (% of GDP)           3.5           3.3         4.1h,n  Social indicators
    12          Education: Primary gross enrol. ratio (f/m per 100 pop.)  80.6 / 118.6  83.5 / 122.7  82.9 / 124.2l  Social indicators
    13        Education: Secondary gross enrol. ratio (f/m per 100 pop.)   33.3 / 66.9   36.8 / 65.9   40.0 / 70.1l  Social indicators
    14  Education: Upper secondary gross enrol. ratio (f/m per 100 pop.)   17.8 / 42.7   27.1 / 52.6   28.5 / 52.4l  Social indicators
    15                      Intentional homicide rate (per 100 000 pop.)           3.4           9.8           6.7l  Social indicators
    16                   Seats held by women in national parliaments (%)          27.3          27.7            27o  Social indicators
    --------------------------------------------------------------------------------
    
    ...