pythonhtmlweb-scrapingbeautifulsouphtml-parsing

How can I best isolate 2 different unlabeled pieces of html using beautiful soup to be printed to a CSV?


To preface, I'm a python beginner and this is my first time using BeautifulSoup. Any input is greatly appreciated.

I'm attempting to scrape all the company names and email addresses from this site. There are 3 layers of links to crawl through (Alphabetized pagination list -> Company list by letter -> Company detail page) and I'd subsequently print them to a csv.

So far, I've been able to isolate the alphabetized list of links with the code below, but I'm stuck when attempting to isolate the different company pages and then extracting the name/email from unlabeled html.

import re
import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.indiainfoline.com/Markets/Company/A.aspx').read()
soup = BeautifulSoup(page)
soup.prettify()

pattern = re.compile(r'^\/Markets\/Company\/\D\.aspx$')

all_links = []
navigation_links = []
root = "http://www.indiainfoline.com/"

# Finding all links
for anchor in soup.findAll('a', href=True):
    all_links.append(anchor['href'])
# Isolate links matching regex
for link in all_links:
    if re.match(pattern, link):
        navigation_links.append(root + re.match(pattern, link).group(0))
navigation_links = list(set(navigation_links))

company_pages = []
for page in navigation_links:
    for anchor in soup.findAll('table', id='AlphaQuotes1_Rep_quote')              [0].findAll('a',href=True):
        company_pages.append(root + anchor['href'])

Solution

  • By pieces. Getting the links to each individual company is easy:

    from bs4 import BeautifulSoup
    import requests
    
    html = requests.get('http://www.indiainfoline.com/Markets/Company/A.aspx').text
    bs = BeautifulSoup(html)
    
    # find the links to companies
    company_menu = bs.find("div",{'style':'padding-left:5px'})
    # print all companies links
    companies = company_menu.find_all('a')
    for company in companies:
        print company['href']
    

    Second, get the companies names:

    for company in companies:
        print company.getText().strip()
    

    Third, emails is a little more complicated, but you can use regex here, so in a independent company page, do the following:

    import re
    # example company page
    html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-Ltd/533096').text
    EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
    re.findall(EMAIL_REGEX, html)
    # and there you got a list of found emails
    ...
    

    Cheers,