I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.
Here's the code I'm using:
import requests
from bs4 import BeautifulSoup
def extract_text(url):
try:
# Fetch the page content
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Find the main text section
text_section = soup.find("div", {"class": "mw-parser-output"})
if not text_section:
raise ValueError("Text section not found.")
# Extract text from paragraphs and other elements
text_elements = text_section.find_all(["p", "div"])
text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
return text
except Exception as e:
print(f"Error extracting text from {url}: {e}")
return None
# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
print(text)
else:
print("No text found.")
Problem: The extract_text function always returns an empty string, even though the page clearly contains text. I suspect the issue is related to the structure of the Wikisource page, but I'm not sure how to fix it.
To find the text section you are using the class mw-parser-output
. But this class is present for two different div
elements. And the first one with this class doesn't contain the texts. The find function returns the first element found. That is why you can't get the texts.
The div with class prp-pages-output
contains all the text you want and the div is inside the second div
with the class you have used. You can use this class to get the texts.
You don't need to parse the p
and div
tags to get the text. You can get the text directly from the parent element and it would work fine.
import requests
from bs4 import BeautifulSoup
def extract_text(url):
try:
# Fetch the page content
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Find the main text section
text_section = soup.find("div", {"class": "prp-pages-output"})
if not text_section:
raise ValueError("Text section not found.")
# Extract text from paragraphs and other elements
text = text_section.get_text().strip()
return text
except Exception as e:
print(f"Error extracting text from {url}: {e}")
return None
# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
print(text)
else:
print("No text found.")
But the first div
and first two p
tag elements are not the text from the book but the data about the book and the previous/next book's title/link. So if you want just the book content and not other texts, then try the following. Here I have used the CSS selector which selects all the elements after the div tag that contains the meta info.
import requests
from bs4 import BeautifulSoup
def extract_text(url):
try:
# Fetch the page content
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the text
text_elements = soup.select("div.prp-pages-output > div[itemid] ~ *")
text = "\n".join(element.get_text().strip() for element in text_elements)
return text
except Exception as e:
print(f"Error extracting text from {url}: {e}")
return None
# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
print(text)
else:
print("No text found.")