jsonpython-3.xweb-scrapingbeautifulsoup

Only Pulling Last Item from Tag in BeautifulSoup


I have a script that loops through multiple webpages, but there is one small issue that I am stuck on. I am trying to add the author to the list but my script pulls the last author from the page and applies it to every author field. How do I get my script to apply each author to the relevant title? Here is my code

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

base_url = "https://archive.org/details/librivoxaudio?&sort=titleSorter"

data = []
n = 5
for i in range(1, n+1):
   response = urlopen(base_url + "&page=" + str(i))
   page_html = response.read()
   response.close()

   #html parsing
   page_soup = soup(page_html, "html.parser")

   #grabs info for each book
   containers = page_soup.findAll("div",{"class":"item-ttl"})
   authors = page_soup.findAll("span",{"class":"byv"})

   for container in containers:
     item = {}
     item['type'] = "Public Domain Audiobook"
     item['title'] = container.text.lstrip().strip()
     for author in authors:
         item['author'] = author.text
     item['link'] = "https://archive.org/" + container.a["href"]
     item['source'] = "LibriVox"
     item['base_url'] = "https://librivox.org/"
     data.append(item) # add the item to the list

     with open("./json/librivoxTest.json", "w") as writeJSON:
       json.dump(data, writeJSON, ensure_ascii=False)

Here is a sample of the output in JSON

{
"type": "Public Domain Audiobook",
"title": "A Book of Old English Ballads",
"author": "Charles Whibley",
"link": "https://archive.org//details/book_old_english_ballads_1007_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
}, {
"type": "Public Domain Audiobook",
"title": "A Book of Scoundrels",
"author": "Charles Whibley",
"link": "https://archive.org//details/scoundrels_1712_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
}

The last author is correct for 'A Book of Scoundrels' but 'A Book of Old English Ballads' should have George Wharton Edwards as the author.


Solution

  • I suppose the below script will fix the issues you are having. I tried to make it in a slightly organized manner.

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    import json
    
    urls = ["https://archive.org/details/librivoxaudio?&sort=titleSorter&page={}".format(page) for page in range(1,3)]
    
    for link in urls:
        soup = BeautifulSoup(urlopen(link).read(), "html.parser")
        data = []
        for container in soup.select("div[data-id$='_librivox']"):
             item = {}
             item['type'] = "Public Domain Audiobook"
             item['title'] = container.select_one(".ttl").get_text(strip=True)
             item['author'] = container.select_one(".byv").get_text(strip=True) if container.select_one(".byv") else ""
             item['link'] = urljoin(link, container.select_one("a[title]")['href']) if container.select_one("a[title]") else ""
             item['source'] = "LibriVox"
             item['base_url'] = "https://librivox.org/"
             data.append(item)
    
        print(json.dumps(data,indent=4))
    

    Output are like:

    [
        {
            "type": "Public Domain Audiobook",
            "title": "\"BOOH!\"",
            "author": "Eugene Field",
            "link": "https://archive.org/details/booh_1403.poem_librivox",
            "source": "LibriVox",
            "base_url": "https://librivox.org/"
        },
        {
            "type": "Public Domain Audiobook",
            "title": "\"You Bid Me Try\"",
            "author": "Henry Austin Dobson",
            "link": "https://archive.org/details/youbid_metry_1104_librivox",
            "source": "LibriVox",
            "base_url": "https://librivox.org/"
        },