pythonwebweb-scrapingyahoo

Web Scraping : Yahoo provides dirtyurl instead of normal url


I'm using mechanize to get the top results from yahoo search and scrape data from them, but yahoo provides only dirtyurls, which gives error on further processing, any solution to obtain original link?

example: For the result stackoverflow.com, I get the following tag

<a dirtyhref="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" id="link-1" class="yschttl spt" href="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" target="_blank" data-bk="5054.1"> <b>Stack Overflow</b> - Official Site </a>

So here http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-

represents http://stackoverflow.com


Solution

  • Assuming that you can isolate easily the content of dirtyhref (you can use BeautifulSoup to parse the link, http://www.crummy.com/software/BeautifulSoup/bs4/doc/), you can use the urlparse package to get only the path (https://docs.python.org/2/library/urlparse.html#urlparse.urlparse). Now you'll have it in a string like:

    dirty_href = "/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-"\
    

    Now, it looks to me that fields are separated by /, so you can:

    fields = dirty_href.split('/')
    

    Assuming that the fields you are interested in is always the sixth:

    dirty_url = fields[5].split('=')[1]
    

    Finally, you can use unquote from the urllib2 package (https://docs.python.org/2/library/urllib.html#urllib.unquote):

    >>> urllib2.unquote(dirty_url)
    'http://stackoverflow.com/'
    

    You can also not assume that the URL will always be in the sixth field, by cycling over fields and check if it starts with RU=.