I am new to Python and web scraping. Recently, I am stuck by Workday's job post links generation pattern. Normally, I found one job post's link pattern is like the following that all the elements can be extracted: (text in bold is fixed)
https://employer's domain.com/en-US/employers'subtext/job/location/job title_job ID
For example, UPenn's Workday main page is:
https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn
and take this job post:
Program Coordinator for Community Care, JR00035938 | VPUL | Posted Yesterday
As so, to compose this job post's link should be like the follows:
And the correct one from the website is as follows:
As you can see, the element shown on the page(aka, the HTML source code scraped by Python) is different, though the pattern is correct. From the source from the Chrome inspect, the job id is JR00035938 without the extra "-1".
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035938 | VPUL | Posted Yesterday" id="gwt-uid-106" data-automation-id="compositeSubHeaderOne">JR00035938 | VPUL | Posted Yesterday</span>
And this is not the only one odd, there are many differences. Here are couples of examples:
1)
Research Specialist A/B (Pennsylvania Muscle Institute) JR00035941
| Clinical Research Building - 7th Floor | Posted Yesterday
its code:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-99" data-automation-label="Research Specialist A/B (Pennsylvania Muscle Institute)" title="Research Specialist A/B (Pennsylvania Muscle Institute)" aria-label="Research Specialist A/B (Pennsylvania Muscle Institute)" role="link" tabindex="0">Research Specialist A/B (Pennsylvania Muscle Institute)</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday" id="gwt-uid-100" data-automation-id="compositeSubHeaderOne">JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday</span>
And its link not only with extra subfix after job ID, but also lacks/rewrites the part after the slash of the job title.
Research Investigator/Research Investigator Sr. (Dept. of Radiology) JR00033660 | HUP | Posted Yesterday
its code:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-107" data-automation-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" title="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" aria-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" role="link" tabindex="0">
Research
<span class=" WHK2 WIK2 ">Investigator/Research</span>
Investigator Sr. (Dept. of Radiology)
</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00033660 | HUP | Posted Yesterday" id="gwt-uid-108" data-automation-id="compositeSubHeaderOne">JR00033660 | HUP | Posted Yesterday</span>
And its link:
At last, here comes my question, what pattern does Workday generate its job post's link? Is there any method to get its link while Workday obviously prevents others to extract the data? There is no a/href/src for the job post link.
Thank you so much in advance!
Links are added dynamically from a GET request so you could always just grab the links that way rather than worry about trying to replicate the pattern.
import requests
headers = { 'User-Agent': 'Mozilla/5.0', 'Accept': 'application/json,application/xml'}
r = requests.get('https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn', headers=headers)
links = ['https://wd1.myworkdaysite.com' + i['title']['commandLink'] for i in r.json()['body']['children'][0]['children'][0]['listItems']]
print(links)