I'm using mechanize to get the top results from yahoo search and scrape data from them, but yahoo provides only dirtyurls, which gives error on further processing, any solution to obtain original link?
example: For the result stackoverflow.com, I get the following tag
<a dirtyhref="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" id="link-1" class="yschttl spt" href="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" target="_blank" data-bk="5054.1"> <b>Stack Overflow</b> - Official Site </a>
represents http://stackoverflow.com
Assuming that you can isolate easily the content of dirtyhref
(you can use BeautifulSoup
to parse the link, http://www.crummy.com/software/BeautifulSoup/bs4/doc/), you can use the urlparse
package to get only the path (https://docs.python.org/2/library/urlparse.html#urlparse.urlparse). Now you'll have it in a string like:
dirty_href = "/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-"\
Now, it looks to me that fields are separated by /
, so you can:
fields = dirty_href.split('/')
Assuming that the fields you are interested in is always the sixth:
dirty_url = fields[5].split('=')[1]
Finally, you can use unquote
from the urllib2
package (https://docs.python.org/2/library/urllib.html#urllib.unquote):
>>> urllib2.unquote(dirty_url)
'http://stackoverflow.com/'
You can also not assume that the URL will always be in the sixth field, by cycling over fields
and check if it starts with RU=
.