i am currently trying to apply a bs4-approach to wikipedia-page: results do not store in a df
due to the fact that scraping on Wikipedia is a very very common technique - where we can use an appropiate approach to work with many many different jobs - i did have some issues with getting back the results - and store it into a df
well - as a example for a very common Wikipedia-bs4 job - we can take this one:
on this page we have more than 600 results - in sub-pages: url = "https://de.wikipedia.org/wikiListe_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"
so to do a first experimental script i follow like so : first i scrape the table from the Wikipedia page and afterwards i convert it into a Pandas DataFrame. Therefore i first install necessary packages: Make sure you have requests, beautifulsoup4, and pandas installed. You can install them using pip if you haven't already:
pip install requests beautifulsoup4 pandas
and then i follow like so : first i scrape the table from the Wikipedia page and afterwards i convert it into a Pandas DataFrame.
import pandas as pd
# URL of the Wikipedia page
url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"
table = pd.read_html(url, extract_links='all')[1]
base_url = 'https://de.wikipedia.org'
table = table.apply(lambda col: [v[0] if v[1] is None else f'{base_url}{v[1]}' for v in col])
links = list(table.iloc[:,0])
for link in links:
print('\n',link)
try:
df = pd.read_html(link)[0]
print(df)
except Exception as e:
print(e)
see what i get back - only two records. instead of hundreds. btw; i guess that the best way would be to collect all in a df. and & / or store it
Document is empty
https://de.wikipedia.org/wiki/Aach_(Hegau)
Wappen \
0 NaN
1 NaN
2 Basisdaten
3 Koordinaten:
4 Bundesland:
5 Regierungsbezirk:
6 Landkreis:
7 Höhe:
8 Fläche:
9 Einwohner:
10 Bevölkerungsdichte:
11 Postleitzahl:
12 Vorwahl:
13 Kfz-Kennzeichen:
14 Gemeindeschlüssel:
15 LOCODE:
16 Adresse der Stadtverwaltung:
17 Website:
18 Bürgermeister:
19 Lage der Stadt Aach im Landkreis Konstanz
20 Karte
Deutschlandkarte
0 NaN
1 NaN
2 Basisdaten
3 47° 51′ N, 8° 51′ OKoordinaten: 47° 51′ N, 8° ...
4 Baden-Württemberg
5 Freiburg
6 Konstanz
7 545 m ü. NHN
8 10,68 km2
9 2384 (31. Dez. 2022)[1]
10 223 Einwohner je km2
11 78267
12 07774
13 KN, STO
14 08 3 35 001
15 DE AAC
16 Hauptstraße 16 78267 Aach
17 www.aach.de
18 Manfred Ossola
19 Lage der Stadt Aach im Landkreis Konstanz
20 Karte
note: we have several hunderds records there:
see the infobox: i am wanting to fetch the data of the infobox
update: what is aimed: - how to get full results - that are stored in a df. - and containing all the data - in the info.box.. (see image above) - with the contact infos etc
update2:
the overview - page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland
it takes us to approx 1000 sub-pages: like the following
Aach (Hegau): https://de.wikipedia.org/wiki/Aach_(Hegau) Aachen: https://de.wikipedia.org/wiki/Aachen Aalen: https://de.wikipedia.org/wiki/Aalen
see a result- of the so called "info-box": https://de.wikipedia.org/wiki/Babenhausen_(Hessen) Babenhausen (Hessen)
+----------------------+--------------------------------------------------------------+
| | |
+----------------------+--------------------------------------------------------------+
| koordinaten: | ♁49° 58′ N, 8° 57′ OKoordinaten: 49° 58′ N, 8° 57′ O | | OSM |
| Bundesland: | Hessen |
| Regierungsbezirk: | Darmstadt |
| Landkreis: | Darmstadt-Dieburg |
| Höhe: | 124 m ü. NHN |
| Fläche: | 66,85 km2 |
| Einwohner: | 17.579 (31. Dez. 2023)[1] |
| Bevölkerungsdichte: | 263 Einwohner je km2 |
| Postleitzahl: | 64832 |
| Vorwahl: | 06073 |
| Kfz-Kennzeichen: | DA, DI |
| Gemeindeschlüssel: | 06 4 32 002 |
| Stadtgliederung: | 6 Stadtteile |
| Adresse der | |
| Stadtverwaltung: | Rathaus |
| Marktplatz 2 | |
| 64832 Babenhausen | |
| Website: | www.babenhausen.de |
| Bürgermeister: | Dominik Stadler (parteilos) |
+----------------------+--------------------------------------------------------------+
https://de.wikipedia.org/wiki/Bacharach https://de.wikipedia.org/wiki/Backnang
update3: if i run this code in order to fetch 300 records . it works well - if i run this in order to fetch 2400 it fails..
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_info(city_url: str) -> dict:
info_data = {}
response = requests.get(city_url)
soup = BeautifulSoup(response.text, 'lxml')
for x in soup.find('tbody').find_all(
lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
if not x.get('style'):
if 'Koordinaten' in x.get_text():
info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
else:
info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
return info_data
cities = []
response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
soup = BeautifulSoup(response.text, 'lxml')
for city in soup.find_all('dd')#[:2500]:
city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
result |= get_info(city_url)
cities.append(result)
df = pd.DataFrame(cities)
print(df.to_string())
------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-4391c852fd75> in <cell line: 24>()
25 city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
26 result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
---> 27 result |= get_info(city_url)
28 cities.append(result)
29 df = pd.DataFrame(cities)
<ipython-input-28-4391c852fd75> in get_info(city_url)
15 else:
16 info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
---> 17 info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
18 return info_data
19
AttributeError: 'NoneType' object has no attribute 'findNext'
Every city in dd tag, so u can just use find_all() function to get Name and URL. Then go one by one every URL and get table. In the example only 5 repetition, delete [:5] in loop for full
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_info(city_url: str) -> dict:
info_data = {}
response = requests.get(city_url)
soup = BeautifulSoup(response.text, 'lxml')
for x in soup.find('tbody').find_all(
lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
if not x.get('style'):
if 'Koordinaten' in x.get_text():
info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
else:
info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
if soup.find('a', {'title': 'Website'}):
info_data['Web site'] = soup.find('a', {'title': 'Website'}).findNext('a').get('href')
return info_data
cities = []
response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
soup = BeautifulSoup(response.text, 'lxml')
for city in soup.find_all('dd'):
city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
result |= get_info(city_url)
cities.append(result)
df = pd.DataFrame(cities)
print(df.to_string())
OUTPUT:
City URL Koordinaten Bundesland Regierungsbezirk Landkreis Höhe Fläche Einwohner Bevölkerungsdichte Postleitzahl Vorwahl Kfz-Kennzeichen Gemeindeschlüssel Adresse derStadtverwaltung Bürgermeister Postleitzahlen Vorwahlen Stadtgliederung Oberbürgermeisterin Oberbürgermeister Erste Bürgermeisterin Erster Bürgermeister
0 Aach (BW) https://de.wikipedia.org/wiki/Aach_(Hegau) 47° 51′ N, 8° 51′ O Baden-Württemberg Freiburg Konstanz 545 m ü.NHN 10,68 km2 2384(31. Dez. 2022)[1] 223 Einwohner je km2 78267 07774 KN,STO 08 3 35 001 Hauptstraße 1678267 Aach Manfred Ossola NaN NaN NaN NaN NaN NaN NaN
1 Aachen (NW) https://de.wikipedia.org/wiki/Aachen 50° 47′ N, 6° 5′ O Nordrhein-Westfalen Köln Städteregion Aachen 175 m ü.NHN 160,85 km2 252.769(31. Dez. 2023)[1] 1571 Einwohner je km2 NaN NaN AC, MON 05 3 34 002 Markt52062 Aachen NaN 52062–52080 0241, 02405, 02407, 02408 7Stadtbezirke Sibylle Keupen(parteilos) NaN NaN NaN
2 Aalen (BW) https://de.wikipedia.org/wiki/Aalen 48° 50′ N, 10° 6′ O Baden-Württemberg Stuttgart Ostalbkreis 430 m ü.NHN 146,58 km2 68.816(31. Dez. 2022)[1] 469 Einwohner je km2 NaN NaN AA, GD 08 1 36 088 NaN NaN 73430–73434, 73453 07361, 07366, 07367 Kernstadtund 8Stadtbezirke NaN Frederick Brütting(SPD) NaN NaN
3 Abenberg (BY) https://de.wikipedia.org/wiki/Abenberg 49° 15′ N, 10° 58′ O Bayern Mittelfranken Roth 414 m ü.NHN 48,41 km2 5614(31. Dez. 2023)[1] 116 Einwohner je km2 91183 09178 RH, HIP 09 5 76 111 Stillaplatz 191183 Abenberg NaN NaN NaN 14Gemeindeteile NaN NaN Susanne König (parteilos) NaN
4 Abensberg (BY) https://de.wikipedia.org/wiki/Abensberg 48° 49′ N, 11° 51′ O Bayern Niederbayern Kelheim 370 m ü.NHN 60,26 km2 14.685(31. Dez. 2023)[1] 244 Einwohner je km2 93326 09443 KEH,MAI,PAR, RID,ROL 09 2 73 111 Stadtplatz 193326 Abensberg NaN NaN NaN 22Gemeindeteile NaN NaN NaN Bernhard Resch[2]