pythonnlpn-gram

Python NLP: Google ngram API


I'm working on a Python NLP task where I need to prune out non-technical/very common noun phrases from a list of noun phrases that contains noise. Here is an example:

["people", "US presidents", "New York City", "electric cars", "vegan food", "the best"]

I need to prune out "people" and "the best". I want to do this using an ngram dataset: the frequency of 'people' and 'the best' is much higher than that of any other noun phrase, so it would be possible to label them as outliers and prune them out. The Google ngram dataset is well suited for this purpose:

url = "https://books.google.com/ngrams/json"

query_params = {
        "content": <my_noun_phrase/string of noun phrases>,
        "year_start": 2017,
        "year_end": 2019,
        "corpus": 26,
        "smoothing": 1,
        "case_insensitive": True
    }
response = requests.get(url=url, params=query_params)

But sadly their API (which is undocumented) can't handle a lot of traffic - I often get 429 errors (too many requests). Is there a better way to interact with the Google ngram API? Or does anyone know other APIs/web services that provide the same functionality (i.e. allow users to retrieve term frequency data for multi-word expressions from a very large corpus)? Thanks in advance!


Solution

  • There is also NGRAMS which lets you search version 3 of this dataset. It also has a REST API.