After searching 100s of answers, I'm here again, asking new question that might help someone in the future.
I'm scraping this website: https://inview.doe.in.gov/state/1088000000/school-list.
The school list is in a flex box and I believe that I can get the data fetched by using selenium. But I want get this job done only by using BeautifulSoup.
By inspecting and tracking the Network connections, I found 2 API calls and I'm not which API gives me the school list. I do have their IPv4 address as well.
api = 'https://inview.doe.in.gov/api/entities?lang=en&merges=[{"route": "entities", "name": "district", "local_field": "district_id", "foreign_field": "id", "fields": "id,name"}]&filter=state_id==1088000000'
ipv4 = '104.18.21.238:443'
api2 = 'https://inview.doe.in.gov/api/entities?filter=type==district,type==network,type==school,type==state&fields=name,type,id,district_id'
ipv4 = '104.18.21.238:443'
Trying to access the content directly gives None as it is dynamaically loaded (at least that's what I believe).
import json
import requests
from bs4 import BeautifulSoup
def url_parser(url):
html_doc = requests.get(url, headers={"Accept":"*/*"}).text
soup = BeautifulSoup(html_doc,'html.parser')
return html_doc, soup
def data_fetch(url):
html_doc, soup = url_parser(url)
api_link = 'https://inview.doe.in.gov/api/entities?lang=en&merges=[{"route": "entities", "name": "district", "local_field": "district_id", "foreign_field": "id", "fields": "id,name"}]&filter=state_id==1088000000'
html_doc2, soup2 = url_parser(api_link)
#school_id = soup2.find_all('div', {'class':'result-table table--results mt-3'})
print(soup2)
def main():
url = "https://inview.doe.in.gov/state/1088000000/school-list"
data_fetch(url)
main()
Trying to open the api link directly gives me the same error message as what I get in the code as below:
{"message":"The resource identified by the request is only capable of generating response entities which have content characteristics not acceptable according to the accept headers sent in the request. Supported entities are: application/json, application/vnd.tembo.api+json, application/vnd.tembo.api+json;version=1","status":406}
Is there any way I can fix that?
for example:
import requests
import pandas as pd
url = "https://inview.doe.in.gov/api/entities?lang=en&merges=[{%22route%22:%20%22entities%22,%20%22name%22:%20%22district%22,%20%22local_field%22:%20%22district_id%22,%20%22foreign_field%22:%20%22id%22,%20%22fields%22:%20%22id,name%22}]&filter=state_id==1088000000"
headers = {
'accept': 'application/vnd.tembo.api+json',
}
schools = []
response = requests.request("GET", url, headers=headers)
for school in response.json()['entities']:
schools.append({
'ID': school['id'],
'Name': school['name'],
'Type': school['type'],
'Grades': (lambda grade: ' - '.join([grade['grades'][0]['name'], grade['grades'][-1]['name']]) if 'grades' in grade else 'NA')(school),
'Phone': (lambda phone: phone['phone_number'] if 'phone_number' in phone else 'NA')(school),
})
df = pd.DataFrame(schools)
print(df.to_string(index=False))
OUTPUT:
ID Name Type Grades Phone
1053105210 Edgewood Intermediate School (5210) school Grade 4 - Grade 6 (317) 803-5024
1053105317 Wanamaker Early Learning Center (5317) school Pre-K - Pre-K (317) 860-4500
1045353742 Wolcott Mills Elementary School (3742) school Pre-K - Pre-K (260) 499-2450
1045353746 Lima-Brighton Elementary (3746) school Pre-K - Pre-K (260) 499-2440
1033352672 Little Cadets Preschool (2672) school Pre-K - Pre-K (000) 000-0000
1014051133 Washington Primary (1133) school Pre-K - Grade 1 (812) 254-8360
1018751365 Royerton Elementary School (1365) school Kindergarten - Grade 5 (765) 282-2044
1018751367 Delta Middle School (1367) school Grade 6 - Grade 8 (765) 747-0869
1018751369 Delta High School (1369) school Grade 9 - Grade 12 (765) 288-5597
1018751409 Eaton Elementary School (1409) school Kindergarten - Grade 5 (765) 396-3301
1018751520 Albany Elementary School (1520) school Kindergarten - Grade 5 (765) 789-6102
1019101387 Yorktown Middle School (1387) school Grade 6 - Grade 8 (765) 759-2660
1019101389 Yorktown High School (1389) school Grade 9 - Grade 12 (765) 759-2550
1019101393 Yorktown Elementary School (1393) school Grade 3 - Grade 5 (765) 759-2770
1019101395 Pleasant View Elementary School (1395) school Kindergarten - Grade 2 (765) 759-2800
1018951375 Wapahani High School (1375) school Grade 9 - Grade 12 (765) 289-7323
1018951377 Selma Middle School (1377) school Grade 6 - Grade 8 (765) 288-7242
1018951381 Selma Elementary School (1381) school Kindergarten - Grade 5 (765) 282-2455
1019701500 Muncie Virtual Academy (1500) school Kindergarten - Grade 12 NA
1019701513 East Washington Academy (1513) school Pre-K - Grade 5 (765) 747-5434
...