pythonweb-scrapingbeautifulsoupwikipediainfobox

What's the best way to extract specific text from Wikipedia's Infobox using BeautifulSoup and lists?


I'm using BeautifulSoup to extract specific text from Wikipedia's Infoboxes (revenue). My code is working if the revenue text is within an 'a' tag. Unfortunately not all pages have their revenues listed within an 'a' tag. Some have their revenue text after 'span' tags, for example. I was wondering what the best / safest way to go about getting the revenue text for a list of companies would be. Would finding another tag in place of 'a' work best? Or something else? Thanks for your help.

company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']

for c in company:
    r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
    soup = BeautifulSoup(r, "lxml")

    rev = re.compile('^Revenue')
    thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
    tdRev = thRev.find_next('td')
    revenue = tdRev.find_all('a')

    for f in revenue:
        print c + " " + f.text
        break

Solution

  • You can try:

    from bs4 import BeautifulSoup
    import urllib
    import re
    company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
    
    for c in company:
        r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
        soup = BeautifulSoup(r, "lxml")
        for tr in soup.findAll('tr'):
            trText = tr.text
            if re.search(r"^\bRevenue\b$", trText):
                match = re.search(r"\w+\$(?:\s+)?[\d\.]+.{1}\w+", trText)
                revenue = match.group()
                print c+"\n"+revenue+"\n"
    

    Output:

    Lockheed_Martin
    US$ 46.132 billion
    Phillips_66
    US$ 161.21 billion
    ConocoPhillips
    US$55.52 billion
    Sysco
    US$44.41 Billion
    Baker_Hughes
    US$ 22.364 billion
    

    Note: You may want to use Wikipedia API instead, i.e.:

    https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json