pythonweb-scrapingbeautifulsouphtml-tablehtml-tableextract

How to scrape the product information from the page using Beautiful Soup in which html table are involved


import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
    for link in links.find_all('a',href=True)[1:]:
        comp=baseurl+link['href']
        productlinks.append(comp)

for link in productlinks:
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    try:
        title=soup.find('h3').text
    except:
        title=' '
    Title.append(title)
    price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
    Brand.append(price)

df = pd.DataFrame(
    
    {"Title": Title, "Price": price}
)
print(df)

The above script was working as expected but I want scrape inforamtion of each product such asupc, product type example to get information of these single page https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html to scrape upc ,product type etc... all other information lies in product information


Solution

  • You can use start= parameter in URL to get next pages:

    import requests
    from bs4 import BeautifulSoup
    
    for page in range(0, 10):  # <-- increase number of pages here
        r = requests.get(
            "https://pk.indeed.com/jobs?q=&l=Lahore&start={}".format(page * 10)
        )
        soup = BeautifulSoup(r.content, "html.parser")
        title = soup.find_all("h2", class_="jobTitle")
    
        for i in title:
            print(i.text)
    
    

    Prints:

    Data Entry Work Online
    newAdmin Assistant
    newNCG Agent
    Data Entry Operator
    newResearch Associate Electrical
    Administrative Assistant (Executive Assistant)
    Admin Assistant Digitally
    newIT Officer (Remote Work)
    OFFICE ASSISTANT
    Cash Officer - Lahore Region
    newDeputy Manager Finance
    Admin Assistant
    Lab Assistant
    newProduct Portfolio & Customer Service Specialist
    Front Desk Officer
    newRelationship Manager, Recovery
    MANAGEMENT TRAINEE PROGRAM
    Email Support Executive (International)
    Data Entry Operator
    Admin officer
    
    ...and so on.