pythonpandasweb-scrapingredex

Web Scraping Wikipedia


I need bring only the continent (North America) using Wikipedia by URL (in the code below, I will replace the country, in this case, "Guatemala", and make it be a parameter in power BI), but I am getting the whole <a tag. How can I do that?

import requests as rq
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import re

url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
page = rq.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
res = soup.find_all('td', class_='infobox-data')
df = pd.DataFrame(res)
df = df.to_numpy()
df = str(df[0])
print(df)
print(re.search('\">(.*?)\</a>', df).group(1))

This is the data frame:

[<td class="infobox-data"><a href="/wiki/North_America" title="North America">North America</a></td>]

and this is the re.search:

<a href="/wiki/North_America" title="North America">North America

Solution

  • I don't know if it's the best solution, but it works and follow some logic. For example I imagine you want to change the country and take the Continent.

    So I basically iterate over all the results in your find_all element and append only the text values in a new list and called the 0 element (the first one):

    url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
    page = rq.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    res = soup.find_all('td', class_='infobox-data')
    info_list = []
    for i in res:
        info_list.append(i.text)
    info_list[0]
    

    Alternatively you can use only the function find from BeautifulSoup, once you just want the first value

    res = soup.find('td', class_='infobox-data')
    res.text