pythonbeautifulsoupdata-extraction

Why when extracting products data shows me that they are duplicate?


When using bs4 the products show me that they are frequent despite all attempts to know solving this problem. I have failed to inform me to solve this problem and where is it?

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
from itertools import zip_longest


page = 'https://niceonesa.com/ar/'

headers = {
    'User-Argent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Viewer/97.9.3678.79'
}

productlink =[]
titles = []
brands=[]
prices=[]
offers = []
rates =[]
rate_starts=[]

for x in range(1,2):
    r= requests.get(f'https://niceonesa.com/ar/appcomponent--best-sales/?page={x}')
    soup = BeautifulSoup(r.content,'lxml')

    productlist = soup.find_all('div' , class_='product-container bg-white rounded-lg')



    for item in productlist:
        for link in item.find_all('a',href= True):
            productlink.append(page + link ['href'])

    # testlink = 'https://niceonesa.com/ar/gifts/travel-size/ola-hair-mini-straightener-brush-n-215-n16250/'
    for link in productlink:
        r = requests.get(link , headers=headers)
        soup = BeautifulSoup(r.content,'lxml')

        try:
            title  =soup.find('h1',class_='title').text.strip()
            titles.append(title)
            print(title)


        except:
            title='non'




        try:
            price = soup.find('h3',class_='preReductionPrice mb-2').text.strip()
            prices.append(price)

        except:
            price = 'non'
        try:
            offer = soup.find('h3',class_='sellingPrice text-nowrap')
            offers.append(offer)
        except:
            offer='non'
        try:
            rate = soup.find('span',class_='num-rating align-review').text.strip()
            rates.append(rate)

        except:
            rate = 'non'
        try:
            rate_start =soup.find('div', class_='num-rating start').text.strip()
            rate_starts.append(rate_start)
        except:
            rate_start = 'non'

    brand = productlist.find_all('h3', class_='brand-product mb-1')
    for i in range(len(brand)):
        brands.append(brand[i].text)
        filelist = ([titles, brands, prices, rates, rate_starts, productlink])
        exported = zip_longest(*filelist)
        with open('oo.csv', 'w' , encoding ='utf-8-sig' , newline='') as filecsv:
            wr = csv.writer(filecsv)
            wr.writerow(['title','brands','price','offer','rate','rate_starts','productlink'])
            wr.writerows(exported)

Thank you

I tried all possible solutions by searching on the Internet for the problem, but I did not find a way, and I could not understand the problem.


Solution

  •     for i in range(len(brand)):
            brands.append(brand[i].text)
            filelist = ([titles, brands, prices, rates, rate_starts, productlink])
            exported = zip_longest(*filelist)
            with open('oo.csv', 'w' , encoding ='utf-8-sig' , newline='') as filecsv:
                wr = csv.writer(filecsv)
                wr.writerow(['title','brands','price','offer','rate','rate_starts','productlink'])
                wr.writerows(exported)
    

    Why are you saving in the the loop? the only thing that should be in that loop is brands.append(brand[i].text) - you're copying all the products all over again for every brand. That's most likely the reason for the duplicates; but there are several other issues with your method...


            except: title='non'
    
            except: price = 'non'
    
            except: offer='non'
    
            except: rate = 'non'
    
            except: rate_start = 'non'
    

    You should be appending in except too, otherwise you'll have uneven and misaligned columns (like say product1 had no offer but product2 did. but you didn't append anything for product1's offer so product2's offer will be the first offer in the list. then when you form the DataFrame, it'll look like product1 had that offer since it will be in the first row, and so on.)


        brand = productlist.find_all('h3', class_='brand-product mb-1')
        for i in range(len(brand)):
            brands.append(brand[i].text)
    

    You can just do brands = [brand.text for brand in productlist.find_all('h3', class_='brand-product mb-1')] instead of appending on a loop, but it would be better if you had a surer way to make sure that the links and brands aligned perfectly - why don't you just get it in the for link in productlink... loop?

        # for link in productlink:
            # r = requests.get(link , headers=headers)
            # soup = BeautifulSoup(r.content,'lxml')
    
            # brand = soup.find(lambda t: t.name=='a' and t.parent.name=='h2')`)
            brand = soup.select_one('h2>a')
            brands.append(brand.text.strip() if brand else 'non')