python-3.xweb-scrapingbeautifulsoupjupyter-notebook

Problem extracting elements inside html <script> using BeautifulSoup in python3


I want to scrape Product Information from Div below, but when I prettify the HTML I am not able to find the main DIV in HTML.

<div class="c2p6A5" data-qa-locator="product-item" data-tracking="product-card"

The elements I am trying to fetch is in the following script. I need to know how can I extract data from Script below:

<script type="application/ld+json"></script>

My code is as follows:

import requests
from bs4 import BeautifulSoup

url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)

print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())

Solution

  • just use .find() or find_all()

    when I do that, I see it's actually in json format, so then can just read that element and have all the data stored that way.

    import requests
    from bs4 import BeautifulSoup
    import json
    import re
    
    url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
    page = requests.get(url)
    
    print(page.status_code)
    print(page.text)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup.prettify())
    
    
    
    alpha = soup.find_all('script',{'type':'application/ld+json'})
    
    jsonObj = json.loads(alpha[1].text)
    
    for item in jsonObj['itemListElement']:
        name = item['name']
        price = item['offers']['price']
        currency = item['offers']['priceCurrency']
        availability = item['offers']['availability'].split('/')[-1]
        availability = [s for s in re.split("([A-Z][^A-Z]*)", availability) if s]
        availability = ' '.join(availability)
    
        url = item['url']
    
        print('Availability: %s  Price: %0.2f %s   Name: %s' %(availability,float(price), currency,name))
    

    Output:

    Availability: In Stock  Price: 82199.00 Rs.    Name: DELL INSPIRON 15 5570 - 15.6"HD - CI5 - 8THGEN - 4GB - 1TB HDD -  AMD RADEON 530 2GB GDDR5.
    Availability: In Stock  Price: 94599.00 Rs.    Name: DELL INSPIRON 15 3576 - 15.6"HD - CI7 - 8THGEN - 4GB - 1TB HRD - AMD Radeon 520 with 2GB GDDR5.
    Availability: In Stock  Price: 106399.00 Rs.    Name: DELL INSPIRON 15 5570 - 15.6"HD - CI7 - 8THGEN - 8GB - 2TB HRD -  AMD RADEON 530 2GB GDDR5.
    Availability: In Stock  Price: 17000.00 Rs.    Name: Dell Latitude E6420 14-inch Notebook 2.50 GHz Intel Core i5 4GB 320GB Laptop
    Availability: In Stock  Price: 20999.00 Rs.    Name: Dell Core i5 6410 8GB Ram Wi-Fi Windows 10 Installed ( Refurb )
    Availability: In Stock  Price: 18500.00 Rs.    Name: Core i-5 Laptop Dell 4GB Ram 15.6 " Display Windows 10 DVD+Rw ( Refurb )
    Availability: In Stock  Price: 8500.00 Rs.    Name: Laptop Dell D620 Core 2 Duo 80_2Gb (Used)
    ...
    

    EDIT: To see the difference in the 2 json structures:

    jsonObj_0 = json.loads(alpha[0].text)
    jsonObj_1 = json.loads(alpha[1].text)
    
    print(json.dumps(jsonObj_0, indent=4, sort_keys=True))
    print(json.dumps(jsonObj_1, indent=4, sort_keys=True))