pythonselenium-webdriver

extract specific div content from soup in python?


I have below url to be scraped -

l = https://cfpub.epa.gov/compliance/criminal_prosecution/index.cfm?action=3&prosecution_summary_id=3420&searchParams=M5%2C%3A%2FXT%2A%5CCYZ%40O%3B%20W%5F%2AYN5%5E%3EK99%2A%29W%3CU%3FV%23DH%5BZ4%247TRPH%3BJQH%229%3FD%3C%26Z%40CY%26%0AM7EFH%21%25%21%3A%23%3DV%40%3A%2A%5F%3AB8%2A%5DR%3BB%25%5E9%5B2D%22I2KE65NEY7M%21%2DU%40%2B8%22J%29Y%23%24LNJ%40DX%24%0A%2F5YJ%3EP%27O%5FK04%5FG%5C%3E%290M4%2E%0A

I have written below piece of code to get contents from this page -

from bs4 import BeautifulSoup
from selenium.webdriver import ChromeOptions
from selenium import webdriver
options = ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options)
driver.get(l)
soup = BeautifulSoup(driver.page_source, 'html.parser')
cp_row_divs = soup.find_all('div', {'class'='cpRow'})

But it gives both cpRow and cpRow odd tags data both. I want both the tags separately. I want data in below variables -

1)FISCAL YEAR = 2023
2)summary = October 7, 2022
Zachary Czubak was sentenced to serve 5 years of probation, complete 180 hours of community work service, and pay a $66,000 criminal fine.CITATION: 18 U.S.C. 371
3)full_text = Zachary Czubak, Patrick Fleming and their co-defendant tampered with federally mandated monitoring devices on private and commercial diesel vehicles and removed required air pollution control equipment on at least 37 vehicles between July 2019 and September 2020.In July 2019, the co-owners of Arm Rippin Toys, including Czubak and Fleming, entered into an agreement to engage in “tuning and deleting” customers’ diesel vehicles. This process involves the removal of emissions control systems which are designed to reduce pollutants being emitted from the vehicles. Under normal operating conditions, an on-board diagnostic (OBD) system will detect any removal and/or malfunction of a vehicle’s emissions control equipment. By modifying OBDs on vehicles, Arm Rippin Toy’s co-owners and employees falsified, tampered with and rendered inaccurate the vehicles’ monitoring devices so that the modified vehicle could continue to function despite the removal or deletion of emissions control equipment. In total, Arm Rippin Toys collected approximately $100,000 for performing unlawful deletes and tunes on diesel vehicles.
February 10, 2023
Patrick Fleming was sentenced to serve 5 years of probation and pay a $66,000 fine.
CITATION: 18 U.S.C. 371
STATUTE:Clean Air Act (CAA)
Title 18 U.S. Criminal Code (TITLE 18)

Any help would be appreciated.


Solution

  • Using the soup.select() method you can use css selectors to target specific elements.

    For example, soup.find_all('div', {'class'='cpRow'}) can be rewritten as soup.select('div.cpRow')

    Then to filter out the elements with odd class you can use

    base = soup.select('div.cpRow:not(.odd)')
    odd = soup.select('div.cpRow.odd')
    
    

    more info on css selectors https://www.w3schools.com/cssref/css_selectors.php

    Does this answer your question?

    By the way instead of selenium you can use requests, a much lighter library

    import requests
    response = requests.get(l)
    page_source = response.content
    soup = BeautifulSoup(page_source, 'html.parser')