pythonweb-scrapingrequestlxmlurlretrieve

Scraping a website using requests and LXML in Python


I'm trying to scrape this website to retrieve the title and body contents ("Description" and "Features"), as well as the PDF link. However, when I attempt to extract the text using XPath /html/body/center[2]/table/tbody/tr[3]/td/font/text(), I receive an empty list. However, as you can see in the following screenshot, there is a block of text after /font.

i

This is my code:

import requests
from lxml import html
from urllib.request import urlretrieve


url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"

try:
    # Send an HTTP GET request to the URL
    response = requests.get(url)
    response.raise_for_status()

    # Parse the HTML content of the page using lxml
    page_content = html.fromstring(response.text)

    # Extract the title using XPath
    title_element = page_content.xpath("/html/body/center[2]/table/tbody/tr[2]/td/strong/font")
    title = title_element[0].text_content() if title_element else "Title not found"

    # Extract the body using XPath
    body_elements = page_content.xpath("/html/body/center[2]/table/tbody/tr[3]/td/font/text()")
    body = "\n".join(body_elements) if body_elements else "Body not found"

    # Extract the download link
    download_link_element = page_content.xpath('//a[starts-with(@href, "/pdf-file/1060035/GME/PC817/1")]')
    if download_link_element:
        download_link = download_link_element[0].attrib['href']
        download_url = f"https://datasheetspdf.com{download_link}"
    else:
        download_url = "Download link not found"

    # Download the file
    file_name = "PC817_datasheet.pdf"
    urlretrieve(download_url, file_name)
    print(f"Title: {title}")
    print(f"Body:\n{body}")
    print(f"Downloaded {file_name} successfully.")

except requests.exceptions.RequestException as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

I appreciate any help.


Solution

  • You might be able to get by with pandas only, in order to get that table:

    import pandas as pd
    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.max_columns', None)
    
    df = pd.read_html('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')[2]
    print(df)
    

    Result in terminal:

    0   1
    0   Part    PC817
    1   Description 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
    2   Feature Production specification 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER FEATURES z Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V) z High isolation voltage between inputc and output (Viso=5000V rms) z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048) PC817 Series Description The PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option. DIP4L APPLICATIONS z Programmable controllers z System appliances.
    3   Manufacture GME
    4   Datasheet   Download PC817 Datasheet
    

    Pandas documentation can be found here.

    EDIT: to download the actual PDF file:

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')
    intermediary_url = 'https://datasheetspdf.com' + bs(r.text, 'html.parser').select_one('a[href^="/pdf-file/"]').get('href')
    r = requests.get(intermediary_url)
    true_pdf_url = bs(r.text, 'html.parser').select_one('iframe[class="pdfif"]').get('src')
    f = open('pdf_file.pdf', 'wb')
    with requests.get(true_pdf_url, stream=True) as r:
        with open('pdf_file.pdf', 'wb') as f:
            f.write(r.content)
    print('done')
    

    File will be downloaded as pdf_file.pdf, in the same folder as your running code. For Requests documentation, go here.