All,
I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch. Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!
Note: There are 239 results,
To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame
import requests
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrape the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=[]
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note: This will take a long time. We are scraping >239 pages. So be patient. Might take 15 mins or longer.