I am trying to write a program that does chemical search on https://echa.europa.eu/ and gets the result. The "Search for Chemicals" field is on the middle of the main webpage. I want to get the resulting URLs from doing search for each chemicals by providing the cas number (ex. 67-56-1). It seems that the URL I get does not include the cas number provided.
I tried inserting different cas number (71-23-8) into "p_p_id" field, but it didn't give expected search result.
https://echa.europa.eu/search-for-chemicals?p_p_id=71-23-8
I also examined the headers of GET methods requested from Chrome which also did not include the cas number.
Is the website using variables to store the input query? Is there a way or a tool that can be used to get the resulting URL including searching cas number?
Once I figure this out, I'll be using Python to get the data and save it as excel file.
Thanks.
You need to get the JESSIONID
cookie by requesting the main url once then send a POST on https://echa.europa.eu/search-for-chemicals
. But this needs also some required URL parameters
query="71-23-8"
millis=$(($(date +%s%N)/1000000))
curl -s -I -c cookie.txt 'https://echa.europa.eu/search-for-chemicals'
curl -s -L -b cookie.txt 'https://echa.europa.eu/search-for-chemicals' \
--data-urlencode "p_p_id=disssimplesearch_WAR_disssearchportlet" \
--data-urlencode "p_p_lifecycle=1" \
--data-urlencode "p_p_state=normal" \
--data-urlencode "p_p_col_id=column-1" \
--data-urlencode "p_p_col_count=2" \
--data-urlencode "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction" \
--data-urlencode "_disssimplesearch_WAR_disssearchportlet_backURL=https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2" \
--data-urlencode "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId=" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_formDate=$millis" \
--data "_disssimplesearch_WAR_disssearchportlet_searchOccurred=true" \
--data "_disssimplesearch_WAR_disssearchportlet_sskeywordKey=$query" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer=on" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox=on"
Using python and scraping with beautifulsoup
import requests
from bs4 import BeautifulSoup
import time
url = 'https://echa.europa.eu/search-for-chemicals'
query = '71-23-8'
s = requests.Session()
s.get(url)
r = s.post(url,
params = {
"p_p_id": "disssimplesearch_WAR_disssearchportlet",
"p_p_lifecycle": "1",
"p_p_state": "normal",
"p_p_col_id": "column-1",
"p_p_col_count": "2",
"_disssimplesearch_WAR_disssearchportlet_javax.portlet.action": "doSearchAction",
"_disssimplesearch_WAR_disssearchportlet_backURL": "https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2",
"_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId": ""
},
data = {
"_disssimplesearchhomepage_WAR_disssearchportlet_formDate": int(round(time.time() * 1000)),
"_disssimplesearch_WAR_disssearchportlet_searchOccurred": "true",
"_disssimplesearch_WAR_disssearchportlet_sskeywordKey": query,
"_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer": "on",
"_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox": "on"
}
)
soup = BeautifulSoup(r.text, "html.parser")
table = soup.find("table")
data = [
(
t[0].find("a").text.strip(),
t[0].find("a")["href"],
t[0].find("div", {"class":"substanceRelevance"}).text.strip(),
t[1].text.strip(),
t[2].text.strip(),
t[3].find("a")["href"] if t[3].find("a") else "",
t[4].find("a")["href"] if t[4].find("a") else "",
)
for t in (t.find_all('td') for t in table.find_all("tr"))
if len(t) > 0 and t[0].find("a") is not None
]
print(data)
Note that I've set the timestamp parameter (formDate param) in case of it's actually checked on the server