pythonweb-scrapingbeautifulsouppython-requestshtml5lib

Scraping a table from website using python and trying to get the hyperlink of content with text


I am learning python, I am trying to scrape a table from https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html website. In this table you can see there are 4 columns "CIN", Company Name", "Roc" and "Status". As you can see "Company Name" is a hyperlink, I need 5 columns "CIN", "Company Name", "Company Link", "Roc" and "Status". for the same I wrote a code, but I got only 4 columns and instead of "Company Link" I got different result. I am sharing the screen shot of my output csv file.

Please help me to scraping this table in 5 columns "CIN", "Company Name", "Company Link", "Roc" and "Status". here is my code and please find the image of my output csv file.

import csv
from bs4 import BeautifulSoup
import re
import html5lib

def find_between(s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

loop = 1
while(True):
    try:
        URL = "https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-" + str(loop) + "-company.html"
        loop=loop+1
        r = requests.get(URL)
        soup = BeautifulSoup(r.content, 'html5lib')
        tbody = soup.find('tbody')
        rows = tbody.find_all('tr')
        row_list = list()
        for tr in rows:
            row=[]
            td = tr.find_all('td')
            for a in td:
                href=a.find('a',href=True)
                if href==None:
                    row.append(a.text.strip())
                    print(a.text.strip())
                else:
                    linktext = href.__getitem__
                    row.append(linktext)
            row_list.append(row)
        with open('zaubadata.csv', 'a') as csvFile:
            writer = csv.writer(csvFile)
            for r in row_list:
                writer.writerow(r)
    except Exception as obj:
        print(obj)
        csvFile.close()
        break




[![result of above code in 4 columns][1]][1]


  [1]: https://i.sstatic.net/oUVLK.png

Solution

  • This script iterates over all pages and writes columns "CIN", "Company Name", "Company Link", "Roc" and "Status" into data.csv:

    import csv
    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-{}-company.html'
    
    page = 1
    all_data = []
    while True:
        soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
    
        rows = soup.select('#table tr:has(td)')
    
        if not rows:
            break
    
        for tr in rows:
            all_data.append([td.get_text(strip=True) for td in tr.select('td')])
            all_data[-1].insert(2, tr.a['href'])
            print(all_data[-1])
    
        page += 1
    
    with open('data.csv', 'w', newline='') as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(["CIN", "Company Name", "Company Link", "Roc", "Status"])
        for row in all_data:
            csv_writer.writerow(row)
    

    Outputs data.csv (screenshot from LibreOffice):

    enter image description here