nlpsearch-enginegoogle-searchbing

Where can I find a corpus of search engine queries?


I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?


Solution

  • There are a couple of datasets like this:

    Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l

    Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download.

    There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not sure if they are still public. However you can explore a bit.