I'm scraping a webpage but having trouble mapping the information into a dataframe. There are no tables in the HTML. Here is an example of the HTML:
html= [
<h2>Event Title<h2>
<div class="row">
<h4>Category 1<h4>
<h4>Category 2<h4>
<h4>Category 3<h4>
<h4>Category 4<h4>
Here is my code using requests and Beautifulsoup in python:
data = []
event = soup.find('h2')
for i in soup.find_all('div', {'class': 'row'}):
categories = [x.text for x in i.findAll('h4')]
info = [x.text for x in i.findAll('div')]
datum = {'event': event.get_text().replace('\n', '').replace('\r', ''),
'categories ': categories ,
'info ': info }
df = pd.DataFrame(data)
The dataframe ends up looking like with one event title and two lists:
index - event - categories - info
1 - Event Title - ['Category 1','Category 2','Category 3','Category 4'] - ["Category 1 \n A\n Category 2\n B\n Category 3\n C\n Category 4\n D\n"]
I would like it to map somehow to end up so that h4 Category 1 is related to div A.
index - event - categories - info
1 - Event Title - Category 1 - A
2 - Event Title - Category 2 - B
3 - Event Title - Category 3 - C
4 - Event Title - Category 4 - D
Since h4 and div are siblings and not parent-child , it is possible to separate this in my web scrape code? I have multiple pages with different event titles and the data is too large to do it by hand.
I have also tried, among others:
data = []
event = soup.find('h2').get_text()
for i in soup.find_all('div', {'class': 'row'}):
categories = [x.text for x in soup.findAll('h4')]
cats = soup.find_all('h4')
cat = cats[3]
info = cat.findNextSiblings('div')
datum = {'event': event, 'categories ': categories , 'info': info}
df1 = pd.DataFrame(data)
The result of this one gives me a df of:
index - event - categories - info
1 - Event Title - ['Category 1','Category 2','Category 3','Category 4'] - [<div>A<div>, <div>B<div>, <div>C<div>, <div>D<div>]
Here is the weblink to inspect the elements: https://www.ibjjfdb.com/ChampionshipResults/926/PublicResults
Any ideas would be helpful. Thank you!
Type, category and info are all at the same level in your linked example, so you'll have to iterate through them and update type and category as soon as a new type or category is encountered (please note - I had to introduce a new column type for the result type).
Regarding the pandas dataframe: it's much better in terms of performance and also easier to read in the code if you first collect all data in a list and only then at the end make a dataframe from this list.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
data = []
r = requests.get("https://www.ibjjfdb.com/ChampionshipResults/926/PublicResults")
soup = BeautifulSoup(r.content)
event = soup.find('h2').get_text(strip=True)
for i in soup.find_all('div', {'class': 'col-xs-12'}):
for s in i.find_all(['h3','h4','div'],recursive=False):
if s.name == 'h3':
typ = re.sub('\s+', ' ', s.get_text(strip=True))
elif s.name == 'h4':
cat = re.sub('\s+', ' ', s.get_text(strip=True))
elif s.name == 'div':
divs = s.find_all('div')
if len(divs) > 0:
for di in divs:
info = re.sub('\s+', ' ', di.get_text(strip=True))
info = re.sub('\s+', ' ', s.get_text(strip=True))
df = pd.DataFrame(data, columns=['Event','Type','Category','Info'])
This yields a dataframe with 452 rows and 4 columns, sample output of df.iloc[0]
Event World Jiu-Jitsu IBJJF Championship 2018
Type Results of Academies
Category Adult Male
Info 10 - Ribeiro Jiu-Jitsu - 15