searchweb-scrapingbotsgoogle-searchsearch-engine-bots

How to build human-like search-engine queries?


I am interacting with a search engine programmatically and I need to trick it into thinking that I am a human making queries, as opposed to a robot. This involves generating queries for which it seems plausible that any ordinary user would search for, like "ncaa football schedule" or "When was the lunar landing?" I'll be making over a thousand of these queries daily, and searching for random words out of a dictionary won't cut it, since that's not a very typical search habit.

So far I have thought of a few ways to generate realistic queries:

The latter approach sounds like it would involve a lot of reverse engineering. And with the former approach, I've been unable to find a list of more than 80-or-so queries - the only sources I've found are AOL trends (50-100) and Google Trends (30).

How might I go about generating a large set of human-like search phrases?
(For any language-dependent answers: I'm programming in Python)


Solution

  • Although this most likely breaks Google's TOS, you can scrape the autocomplete data easily:

    import requests
    import json
    
    def autocomplete(query, depth=1, lang='en'):
        if depth == 0:
            return
    
        response = requests.get('https://clients1.google.com/complete/search', params={
            'client': 'hp',
            'hl': lang,
            'q': query
        }).text
    
        data = response[response.index('(') + 1:-1]
        o = json.loads(data)
    
        for result in o[1]:
            suggestion = result[0].replace('<b>', '').replace('</b>', '')
            yield suggestion
    
            if depth > 1:
                for s in autocomplete(suggestion, depth - 1, lang):
                    yield s
    

    autocomplete('a', depth=2) gives you the top 110 queries that start with a (with some duplicates). Scrape each letter to a depth of 2, and you should have a ton of legitimate queries to choose from.