I am using BeautifulSoup
to scrape from the
https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019
There are a total of two pages of information and to navigate over the pages, there are several links in the top as well in the bottom like 1,2. These links use _dopostback
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2','Page$2')"
The problem is when we try to navigate from one page to another, the Url doesn't change only the bold text changes i.e for Page 1 it is Page$1
, for Page 2 it is Page$2
. How do I use BeautifulSoup to iterate over several pages and extract the information? The form data is as follows.
ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2 ctl00$ContentPlaceHolder1$ddl_District: 019 ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G __EVENTTARGET: ctl00$ContentPlaceHolder1$GridView2 __EVENTARGUMENT: Page$2
There is also a variable called _VIEWSTATE
in the form data, but the contents are so huge.
I looked at multiple solutions and posts that are suggesting to see the parameters of post
call and use them but I am unable to make sense of the parameters that are provided in post
.
You can use this example how to load next page on this site using requests
:
import requests
from bs4 import BeautifulSoup
url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def load_page(soup, page_num):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
payload = {
"ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
"__EVENTARGUMENT": "Page${}".format(page_num),
"__LASTFOCUS": "",
"__ASYNCPOST": "true",
}
for inp in soup.select("input"):
payload[inp["name"]] = inp.get("value")
payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
del payload["ctl00$ContentPlaceHolder1$chk_Available"]
api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(
requests.post(api_url, data=payload, headers=headers).content,
"html.parser",
)
return soup
# print hospitals from first page:
for h5 in soup.select("h5"):
print(h5.text)
# load second page
soup = load_page(soup, 2)
# print hospitals from second page
for h5 in soup.select("h5"):
print(h5.text)
Prints:
AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
Calcutta National Medical College and Hospital (Government Hospital)
CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
College of Medicine Sagore Dutta Hospital (Government Hospital)
ESI Hospital Maniktala (Government Hospital)
ESI Hospital Sealdah (Government Hospital)
I.D. And B.G. Hospital (Government Hospital)
M R Bangur Hospital (Government Hospital)
Medical College and Hospital, Kolkata, (Government Hospital)
Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
R. G. Kar Medical College and Hospital (Government Hospital)
Sambhunath Pandit Hospital (Government Hospital)