pythonweb-scrapingbeautifulsouppython-requestsscrapinghub

Trying to scrape image urls but not able to get it using beautiful soup and python


I am scraping this link : https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds

and get image urls

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json


AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']

html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})

for x in address.find_all('div', class_ = 'filmstrip-imgContainer'):
    print(x.find('div').get('img'))

but i am getting output as the following :

None
None
None
None
None
None
None

The follwing is the image of the html code from where I am trying to get the image urls :

html code from where image urls are being fetched

This is the section of page from where I'd like to get the urls

image urls

I'd like to get to know if there are any changes to be made in the code so that I get all the image urls.


Solution

  • They are dynamically loaded from a script tag. You can easily regex them from the .text of the response. The regex below specifically matches the 7 images you say you want to retrieve and show in the picture.

    import requests, re
    
    r = requests.get('https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds').text
    p = re.compile(r'imgurl":"(.*?)"')
    links = p.findall(r)
    print(links)
    

    Regex explanation:

    enter image description here


    Were you to decide to go with the more expensive selenium you could match with

    links = [i['src'] for i in driver.find_all_elements_with_css_selector('.filmstrip-imgContainer img')]