I'm trying to scrape this website to retrieve the title and body contents ("Description" and "Features"), as well as the PDF link. However, when I attempt to extract the text using XPath /html/body/center[2]/table/tbody/tr[3]/td/font/text()
, I receive an empty list. However, as you can see in the following screenshot, there is a block of text after /font
.
This is my code:
import requests
from lxml import html
from urllib.request import urlretrieve
url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"
try:
# Send an HTTP GET request to the URL
response = requests.get(url)
response.raise_for_status()
# Parse the HTML content of the page using lxml
page_content = html.fromstring(response.text)
# Extract the title using XPath
title_element = page_content.xpath("/html/body/center[2]/table/tbody/tr[2]/td/strong/font")
title = title_element[0].text_content() if title_element else "Title not found"
# Extract the body using XPath
body_elements = page_content.xpath("/html/body/center[2]/table/tbody/tr[3]/td/font/text()")
body = "\n".join(body_elements) if body_elements else "Body not found"
# Extract the download link
download_link_element = page_content.xpath('//a[starts-with(@href, "/pdf-file/1060035/GME/PC817/1")]')
if download_link_element:
download_link = download_link_element[0].attrib['href']
download_url = f"https://datasheetspdf.com{download_link}"
else:
download_url = "Download link not found"
# Download the file
file_name = "PC817_datasheet.pdf"
urlretrieve(download_url, file_name)
print(f"Title: {title}")
print(f"Body:\n{body}")
print(f"Downloaded {file_name} successfully.")
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
except Exception as e:
print(f"An error occurred: {e}")
I appreciate any help.
You might be able to get by with pandas only, in order to get that table:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
df = pd.read_html('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')[2]
print(df)
Result in terminal:
0 1
0 Part PC817
1 Description 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
2 Feature Production specification 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER FEATURES z Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V) z High isolation voltage between inputc and output (Viso=5000V rms) z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048) PC817 Series Description The PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option. DIP4L APPLICATIONS z Programmable controllers z System appliances.
3 Manufacture GME
4 Datasheet Download PC817 Datasheet
Pandas documentation can be found here.
EDIT: to download the actual PDF file:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')
intermediary_url = 'https://datasheetspdf.com' + bs(r.text, 'html.parser').select_one('a[href^="/pdf-file/"]').get('href')
r = requests.get(intermediary_url)
true_pdf_url = bs(r.text, 'html.parser').select_one('iframe[class="pdfif"]').get('src')
f = open('pdf_file.pdf', 'wb')
with requests.get(true_pdf_url, stream=True) as r:
with open('pdf_file.pdf', 'wb') as f:
f.write(r.content)
print('done')
File will be downloaded as pdf_file.pdf
, in the same folder as your running code.
For Requests documentation, go here.