python parsing web-scraping beautifulsoup

How do I scrape the specific data I want?

I am taking my first crack at data scraping and am having trouble getting the specific data I want. Ultimately, I want to identify all players that were born and/or played high school baseball in Colorado and save their name and place of birth in a dictionary. I'm able to identify the tag(s) the data is contained within, but I haven't been able to successfully retrieve the data.

I have studied scraping extensively but haven't been able to make much progress. It seems to me that I need to use soup.find_all('tag', attrs={}) to parse the data I need, but I have had difficulty determining how to identify the data I want with 'attrs'. If there is a relevant post for this topic already, I'm happy to review that as well. I was unable to find a post that was helpful, likely due to a lack of technical knowledge on my part.

If the B-Ref homepage is https://www.baseball-reference.com/.

Thank you

#Python program to scrape website

import requests
import html5lib
from bs4 import BeautifulSoup
import csv

URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

#print(soup.prettify())

a_tag = soup.find_all('a')

print(a_tag)
#Colorado_Born_and_HS = {}
#Colorado_Born = {}
#Colorado_HS = {}

I have tried a variety of approaches including soup.find, .find_all, .find_all_next, .next_siblings, etc. I did not include all of these in the sample of my code because it was messy and I imagine this question has a relatively simple answer.

Solution

Here I found the date for your example:

#Python program to scrape website

import requests
import html5lib
from bs4 import BeautifulSoup

URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

# Using soup
birthday = soup.find('span', id="necro-birth")
print(birthday.text.strip())

# Using plaintext
txt = str(r.content)
born_dirty = txt.split("was born in")[1].split("</a>")[0]
born = born_dirty.split("<")[0] + born_dirty.split(">")[1]
born = born.strip()

print(born)

output:

November 22, 1972
Zanesville, OH

Reading the page the URL leads to, I noticed that the date of birth is in a span with an id. This is ideal since we can find that simply.

That said, I don't usually use BS4 for scraping, I simply take the text of the page and split it on points of interest until I get what I want. That is the "plaintext example". Note that I ctrl+F the Zanesville on the page the URL leads to and decided I want to use the second occurrence so I went after that. If the FAQ is not auto-generated, this might not be universal.

You can throw in some asserts for sanity checks, like replacing:

born_dirty = txt.split("was born in")[1].split("</a>")[0]

with

t1 = txt.split("was born in")
assert(len(t1) == 3)
born_dirty = t1[1].split("</a>")[0]