I'm using BeautifulSoup to extract specific text from Wikipedia's Infoboxes (revenue). My code is working if the revenue text is within an 'a' tag. Unfortunately not all pages have their revenues listed within an 'a' tag. Some have their revenue text after 'span' tags, for example. I was wondering what the best / safest way to go about getting the revenue text for a list of companies would be. Would finding another tag in place of 'a' work best? Or something else? Thanks for your help.
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
rev = re.compile('^Revenue')
thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
tdRev = thRev.find_next('td')
revenue = tdRev.find_all('a')
for f in revenue:
print c + " " + f.text
break
You can try:
from bs4 import BeautifulSoup
import urllib
import re
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
for tr in soup.findAll('tr'):
trText = tr.text
if re.search(r"^\bRevenue\b$", trText):
match = re.search(r"\w+\$(?:\s+)?[\d\.]+.{1}\w+", trText)
revenue = match.group()
print c+"\n"+revenue+"\n"
Output:
Lockheed_Martin
US$ 46.132 billion
Phillips_66
US$ 161.21 billion
ConocoPhillips
US$55.52 billion
Sysco
US$44.41 Billion
Baker_Hughes
US$ 22.364 billion
Note: You may want to use Wikipedia API instead, i.e.:
https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json