python web-scraping beautifulsoup html-parsing

Extracting text from Wikisource using BeautifulSoup returns empty result

I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.

Here's the code I'm using:

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "mw-parser-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

Problem: The extract_text function always returns an empty string, even though the page clearly contains text. I suspect the issue is related to the structure of the Wikisource page, but I'm not sure how to fix it.

Solution

To find the text section you are using the class mw-parser-output. But this class is present for two different div elements. And the first one with this class doesn't contain the texts. The find function returns the first element found. That is why you can't get the texts.

The div with class prp-pages-output contains all the text you want and the div is inside the second div with the class you have used. You can use this class to get the texts.

You don't need to parse the p and div tags to get the text. You can get the text directly from the parent element and it would work fine.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "prp-pages-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text = text_section.get_text().strip()
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

But the first div and first two p tag elements are not the text from the book but the data about the book and the previous/next book's title/link. So if you want just the book content and not other texts, then try the following. Here I have used the CSS selector which selects all the elements after the div tag that contains the meta info.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the text
        text_elements = soup.select("div.prp-pages-output > div[itemid] ~ *")
        text = "\n".join(element.get_text().strip() for element in text_elements)
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource.org/wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")