I am using BeautifulSoup to extract image URLs from an HTML structure in Python. The HTML structure contains several <img>
tags with the src
attribute. I've implemented the _get_images
function, which uses BeautifulSoup's find_all("img")
method to retrieve the image URLs. However, I'm facing an issue where some image URLs are returning as None
even though the src
attribute is present in the HTML.
Here's my _get_images
function:
def _get_images(self, soup):
article_images = []
images = soup.find_all("img")
for img in images:
src = img.get('src')
article_images.append(src)
return article_images
The output I get shows that some URLs are None
, while others are correctly retrieved. I have checked the HTML structure, and the <img>
tags do contain the src
attribute. What could be causing this problem, and how can I resolve it to fetch all the image URLs correctly?
What could be causing this problem, and how can I resolve it to fetch all the image URLs and titles correctly? My goal is to have a list of URLs, where each URL contains the src
the image, and to ensure that no None values
are present in the list.
Possibly the img elements are dynamic elements.
To extract the values of src
attribute from the <img>
elements you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Code block:
def _get_images(self):
article_images = [my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.TAG_NAME, "img")))]
return article_images
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC