import urllib
import re
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']
for i in range(len(stocks_symbols)):
htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
htmltext = htmlfile.read(htmlfile)
regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
pattern1 = re.compile(regex1)
name1 = re.findall(pattern1, htmltext)
print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]
I guess the problem is in regex1
,
regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
I tried reading documentation but was unable to figure it out.
In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.
what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.
OutPut:
Traceback (most recent call last):
File "C:\Py\stock\stocks.py", line 14, in <module>
pattern1 = re.compile(regex1)
File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: nothing to repeat
^
matches the start of a string and a ?
after that is not a legal regex. If you change your regex to regex1 = '(.+?)'
it should work. Note that you also had one parenthesis too much.
Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.
The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.