pythonparsingweb-scrapingbeautifulsoup

How do I scrape the specific data I want?


I am taking my first crack at data scraping and am having trouble getting the specific data I want. Ultimately, I want to identify all players that were born and/or played high school baseball in Colorado and save their name and place of birth in a dictionary. I'm able to identify the tag(s) the data is contained within, but I haven't been able to successfully retrieve the data.

I have studied scraping extensively but haven't been able to make much progress. It seems to me that I need to use soup.find_all('tag', attrs={}) to parse the data I need, but I have had difficulty determining how to identify the data I want with 'attrs'. If there is a relevant post for this topic already, I'm happy to review that as well. I was unable to find a post that was helpful, likely due to a lack of technical knowledge on my part.

If the B-Ref homepage is https://www.baseball-reference.com/.

Thank you

#Python program to scrape website

import requests
import html5lib
from bs4 import BeautifulSoup
import csv

URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

#print(soup.prettify())

a_tag = soup.find_all('a')

print(a_tag)
#Colorado_Born_and_HS = {}
#Colorado_Born = {}
#Colorado_HS = {}

I have tried a variety of approaches including soup.find, .find_all, .find_all_next, .next_siblings, etc. I did not include all of these in the sample of my code because it was messy and I imagine this question has a relatively simple answer.


Solution

  • Here I found the date for your example:

    #Python program to scrape website
    
    import requests
    import html5lib
    from bs4 import BeautifulSoup
    
    URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
    r = requests.get(URL)
    
    soup = BeautifulSoup(r.content, 'html5lib')
    
    # Using soup
    birthday = soup.find('span', id="necro-birth")
    print(birthday.text.strip())
    
    # Using plaintext
    txt = str(r.content)
    born_dirty = txt.split("was born in")[1].split("</a>")[0]
    born = born_dirty.split("<")[0] + born_dirty.split(">")[1]
    born = born.strip()
    
    print(born)
    

    output:

    November 22, 1972
    Zanesville, OH
    

    Reading the page the URL leads to, I noticed that the date of birth is in a span with an id. This is ideal since we can find that simply.

    That said, I don't usually use BS4 for scraping, I simply take the text of the page and split it on points of interest until I get what I want. That is the "plaintext example". Note that I ctrl+F the Zanesville on the page the URL leads to and decided I want to use the second occurrence so I went after that. If the FAQ is not auto-generated, this might not be universal.

    You can throw in some asserts for sanity checks, like replacing:

    born_dirty = txt.split("was born in")[1].split("</a>")[0]
    

    with

    t1 = txt.split("was born in")
    assert(len(t1) == 3)
    born_dirty = t1[1].split("</a>")[0]