[SOLVED] Using Regex to get multiple data on single line by scraping stocks from yahoo

Using Regex to get multiple data on single line by scraping stocks from yahoo

import urllib
import re

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for i in range(len(stocks_symbols)):
    htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
    htmltext = htmlfile.read(htmlfile)
    regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
    pattern1 = re.compile(regex1)
    name1 = re.findall(pattern1, htmltext)
    print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]

I guess the problem is in regex1,

regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'

I tried reading documentation but was unable to figure it out.

In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.

what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.

OutPut:

Traceback (most recent call last):
  File "C:\Py\stock\stocks.py", line 14, in <module>
    pattern1 = re.compile(regex1)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
error: nothing to repeat

Solution

^ matches the start of a string and a ? after that is not a legal regex. If you change your regex to regex1 = '(.+?)' it should work. Note that you also had one parenthesis too much.

Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.

The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.