pythonselenium-webdriverbeautifulsouphtml-parser

Issue with BeautifulSoup: Some Image URLs Returning as None Despite `src` Attribute Presence


I am using BeautifulSoup to extract image URLs from an HTML structure in Python. The HTML structure contains several <img> tags with the src attribute. I've implemented the _get_images function, which uses BeautifulSoup's find_all("img") method to retrieve the image URLs. However, I'm facing an issue where some image URLs are returning as None even though the src attribute is present in the HTML.

Here's my _get_images function:

def _get_images(self, soup):
    article_images = []
    images = soup.find_all("img")

    for img in images:
        src = img.get('src')
        article_images.append(src)

    return article_images

The output I get shows that some URLs are None, while others are correctly retrieved. I have checked the HTML structure, and the <img> tags do contain the src attribute. What could be causing this problem, and how can I resolve it to fetch all the image URLs correctly?

What could be causing this problem, and how can I resolve it to fetch all the image URLs and titles correctly? My goal is to have a list of URLs, where each URL contains the src the image, and to ensure that no None values are present in the list.


Solution

  • Possibly the img elements are dynamic elements.


    Solution

    To extract the values of src attribute from the <img> elements you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:

    Code block:

    def _get_images(self):
        article_images = [my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.TAG_NAME, "img")))]
        return article_images
    

    Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC