pythonjsonweb-scrapingurllib

Scrape links from many Google searches in Python


I want to scrape the 1st link that shows up on a Google search for 23000 searches and append them to the dataframe I am using. This is the error I am getting:

Traceback (most recent call last):
File "file.py", line 26, in <module>
website = showsome(company)
File "file.py", line 18, in showsome
hits = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'

This is the code I have so far:

import json
import urllib
import pandas as pd

def showsome(searchfor):
    query = urllib.urlencode({'q': searchfor})
    url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
    search_response = urllib.urlopen(url)
    search_results = search_response.read()
    results = json.loads(search_results)
    data = results['responseData']
    hits = data['results']
    d = hits[0]['visibleUrl']
    return d

company_names = pd.read_csv("my_file.csv")

websites = []
for company in company_names["Company"]:
    website = showsome(company)
    websites.append(website)
websites = pd.DataFrame(websites, columns=["Website"])

result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")

(I changed the name of the input and output files for privacy reasons)

Thank you!


Solution

  • I will try just answer why this exception is raised-

    I see google detects you and post a formatted nice response i.e.

    {u'responseData': None, u'responseDetails': u'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403}
    

    Which is the then assigned to results by below expression.

    results = json.loads(search_results)
    

    So data = results['responseData'] is equals to None and when you run hits = data['results'] - data['results'] raises error since data is None and it does not have results attribute-

    I tried to use random module ( just a simple try) to simulate real through some wait- ( But i strongly oppose using this if you do not have permission from google BTW i used time.sleep(random.choice((1,3,3,2,4,1,0))) as below.

    import json,random,time
    import urllib
    import pandas as pd
    
    def showsome(searchfor):
        query = urllib.urlencode({'q': searchfor})
        url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
        search_response = urllib.urlopen(url)
        search_results = search_response.read()
        results = json.loads(search_results)
        data = results['responseData']
        hits = data['results']
        d = hits[0]['visibleUrl']
        return d
    
    company_names = pd.read_csv("my_file.csv")
    
    websites = []
    for company in company_names["Company"]:
        website = showsome(company)
        websites.append(website)
        time.sleep(random.choice((1,3,3,2,4,1,0)))
        print website
    websites = pd.DataFrame(websites, columns=["Website"])
    
    result = pd.concat([company_names,websites], axis=1, join='inner')
    result.to_csv("export_file.csv", index=False, encoding="utf-8")
    

    It generates csv that contains-

    Company,Website
    American Axle,www.aam.com
    American Broadcasting Company,en.wikipedia.org
    American Eagle Outfitters,ae.com
    American Electric Power,www.aep.com
    American Express,www.americanexpress.com
    American Family Insurance,www.amfam.com
    American Financial Group,www.afginc.com
    American Greetings,www.americangreetings.com