pythonweb-scrapinghierarchicalgpt-4

Scraping a Index of text into ordered folders


I've been trying to scrape a web page index. Seeking advice or avenues to explore for how to go about this task.

The index is side by side another index with the same names. But that left index leads to videos or forums which I dont need. I want to capture the right Index which leads to chapters and text commentary.

This is the page https://www.theseason.org/nt.htm To the right is the index of Bible books I want scraped. Excluding entries just below it.

I've been using Chat GPT 4 having poor results so far.

What I've tried: I dabbled with python and was only able to get a list of links of the index. But it was missing three books in that index for unknown reasons. And it didn't capture any chapters or text contents.

I've tried WinHTTrack application to scrape the html files for me. That was the best results so far and may return to it.

But I'm drawing a blank how to create the hierarchical folder structure which contains the text for each chapter. Or I may manually do this.


Solution

  • Here is an example how you can get the 4th column (names + links) into pandas dataframe.

    Afterwards you can iterate over these links and get information you need from them.

    import re
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.theseason.org/nt.htm"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    table = soup.select_one("table:not(:has(table))")
    
    data = []
    for row in table.select("tr"):
        tds = row.select("td")
        if tds[3].a:
            name = re.sub(r"\s{2,}", " ", tds[3].get_text(strip=True, separator=" "))
            link = tds[3].a["href"]
    
            data.append((name, link))
    
    df = pd.DataFrame(data, columns=["name", "link"])
    print(df.head())
    

    Prints:

              name                                               link
    0      Genesis          https://theseason.org/genesis/genesis.htm
    1       Exodus            https://theseason.org/exodus/exodus.htm
    2    Leviticus  https://www.theseason.org/leviticus/leviticus.htm
    3      Numbers      https://www.theseason.org/numbers/numbers.htm
    4  Deuteronomy            https://www.theseason.org/deut/deut.htm