I am extremely new to Python and web scraping. I want to build a simple script for checking the NVIDIA driver page for new versions from time to time using the code below.
from bs4 import BeautifulSoup
import requests
url = "https://www.nvidia.com/en-us/drivers/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Find the version number (572.83)
version_number = soup.find('td', class_='version').text
print(f"Defined version number: {version_number}")
# Check if the correct version number is present
expected_version = "572.83"
if version_number != expected_version:
print(f"Warning: Incorrect version ({version_number}) found instead of {expected_version}")
I expected to get the following outputs running this script:
Defined version number: 572.83
When I run the script I get the following outputs:
Defined version number: ~ddVersion_td~
Warning: Incorrect version (~ddVersion_td~) found instead of 572.83
Can anyone shed some light on what I am missing? I have scrubbed through a few different forums and the documentation but I haven't been able to dedicate much time to it.
As user @marktolonen points out in the comments, the issue is that the page loads with a placeholder in the element you select. Once the user selects a specific product, the page updates the element with the version you're after. This happens with JavaScript and pages loaded with requests
in Python will just be the plain file, no scripts are executed.
You could do what Mark suggests, get selenium
and a web driver for a browser you have installed and then puppeteer the browser to load the page, select the product and run the selector you wrote. Since this uses an actual browser to load the page, all scripts will run as intended.
However, if you open the page in a browser and open the Developer Tools, you can monitor network traffic. If you do, and then select a product (for this example 'RTX' drivers), you'll notice that the page loads a few service requests including:
https://gfwsl.geforce.com/services_toolkit/services/com/nvidia/services/AjaxDriverService.php?func=DriverManualLookup&psid=122&pfid=935&osID=57&languageCode=1033&beta=null&isWHQL=1&dltype=-1&dch=1&upCRD=null&qnf=0&ctk=null&sort1=1&numberOfResults=1
If you load that URL in the browser, it looks like JSON with the information you require.
This Python script loads only that information and then accesses a field that looks like it has what you want:
import json
import urllib
url = 'https://gfwsl.geforce.com/services_toolkit/services/com/nvidia/services/AjaxDriverService.php?func=DriverManualLookup&psid=122&pfid=935&osID=57&languageCode=1033&beta=null&isWHQL=1&dltype=-1&dch=1&upCRD=null&qnf=0&ctk=null&sort1=1&numberOfResults=1'
response = urllib.request.urlopen(url)
data = response.read()
data = json.loads(data.decode('utf-8'))
print(data['IDS'][0]['downloadInfo']['Version'])
Output:
572.83
You may need to have a closer look at the data
, perhaps there's another field that's more specific to what you need, and you may need the data for another product - but the basic principle should be clear.
In this case, loading the service request is probably simpler in code than using selenium
. And doing it this way is far more efficient and faster. However, NVIDIA could change the syntax of services in their back end at any time, and your script could break unexpectedly, even when the page remains outwardly the same.
However, the same is true for the front end - a designer could move buttons around or change identifiers and your selenium
script could similarly break. So I think going the way of accessing the services in the back end makes the most sense.
Note that many sites will require you to load the main page first to get some sort of cookie or headers that may then be required to load service requests. You can do that from Python as well, but of course that complicates matters to the point where the ease of writing a Selenium script may become preferable. Here, that's not the case.