I am interacting with a search engine programmatically and I need to trick it into thinking that I am a human making queries, as opposed to a robot. This involves generating queries for which it seems plausible that any ordinary user would search for, like "ncaa football schedule" or "When was the lunar landing?" I'll be making over a thousand of these queries daily, and searching for random words out of a dictionary won't cut it, since that's not a very typical search habit.
So far I have thought of a few ways to generate realistic queries:
The latter approach sounds like it would involve a lot of reverse engineering. And with the former approach, I've been unable to find a list of more than 80-or-so queries - the only sources I've found are AOL trends (50-100) and Google Trends (30).
How might I go about generating a large set of human-like search phrases?
(For any language-dependent answers: I'm programming in Python)
Although this most likely breaks Google's TOS, you can scrape the autocomplete data easily:
import requests
import json
def autocomplete(query, depth=1, lang='en'):
if depth == 0:
return
response = requests.get('https://clients1.google.com/complete/search', params={
'client': 'hp',
'hl': lang,
'q': query
}).text
data = response[response.index('(') + 1:-1]
o = json.loads(data)
for result in o[1]:
suggestion = result[0].replace('<b>', '').replace('</b>', '')
yield suggestion
if depth > 1:
for s in autocomplete(suggestion, depth - 1, lang):
yield s
autocomplete('a', depth=2)
gives you the top 110 queries that start with a
(with some duplicates). Scrape each letter to a depth of 2, and you should have a ton of legitimate queries to choose from.