I am taking my first crack at data scraping and am having trouble getting the specific data I want. Ultimately, I want to identify all players that were born and/or played high school baseball in Colorado and save their name and place of birth in a dictionary. I'm able to identify the tag(s) the data is contained within, but I haven't been able to successfully retrieve the data.
I have studied scraping extensively but haven't been able to make much progress. It seems to me that I need to use soup.find_all('tag', attrs={}) to parse the data I need, but I have had difficulty determining how to identify the data I want with 'attrs'. If there is a relevant post for this topic already, I'm happy to review that as well. I was unable to find a post that was helpful, likely due to a lack of technical knowledge on my part.
If the B-Ref homepage is https://www.baseball-reference.com/.
Thank you
#Python program to scrape website
import requests
import html5lib
from bs4 import BeautifulSoup
import csv
URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
#print(soup.prettify())
a_tag = soup.find_all('a')
print(a_tag)
#Colorado_Born_and_HS = {}
#Colorado_Born = {}
#Colorado_HS = {}
I have tried a variety of approaches including soup.find, .find_all, .find_all_next, .next_siblings, etc. I did not include all of these in the sample of my code because it was messy and I imagine this question has a relatively simple answer.
Here I found the date for your example:
#Python program to scrape website
import requests
import html5lib
from bs4 import BeautifulSoup
URL = 'https://www.baseball-reference.com/players/p/paytoja01.shtml'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
# Using soup
birthday = soup.find('span', id="necro-birth")
print(birthday.text.strip())
# Using plaintext
txt = str(r.content)
born_dirty = txt.split("was born in")[1].split("</a>")[0]
born = born_dirty.split("<")[0] + born_dirty.split(">")[1]
born = born.strip()
print(born)
output:
November 22, 1972
Zanesville, OH
Reading the page the URL leads to, I noticed that the date of birth is in a span with an id. This is ideal since we can find that simply.
That said, I don't usually use BS4 for scraping, I simply take the text of the page and split it on points of interest until I get what I want. That is the "plaintext example". Note that I ctrl+F the Zanesville on the page the URL leads to and decided I want to use the second occurrence so I went after that. If the FAQ is not auto-generated, this might not be universal.
You can throw in some asserts for sanity checks, like replacing:
born_dirty = txt.split("was born in")[1].split("</a>")[0]
with
t1 = txt.split("was born in")
assert(len(t1) == 3)
born_dirty = t1[1].split("</a>")[0]