google-cloud-platformgoogle-bigquerygoogle-patent-search

Querying the Google Patent Data on Big Query processes way too much data


I am trying to programmatically retrieve patent data through the BigQuery API and the Google Patents Public Data, however, I found that each query (that retrieves several fields) consumes excessive amounts of data, making it almost impossible in the free tier, and ridiculously expensive (like 1 or 2 $ per query) if paying for it, as each query appears to process like 300 Gb. This does not look right to me at all, so I would like to know what I am doing wrong here...

So if I do a simple search in the console, such as:

SELECT
  publication_number
FROM
  patents-public-data.patents.publications
WHERE
  publication_number = "US-9370603-B2"
LIMIT 1

It consumes like 2.3 Gb, which is a non-negligible part of the free 1 TiB, but still something reasonable, at least to test it and play around a bit, and using it in an app in a limited way.

However, if I create a script to access to patent data through the API, and I add the dry_run_job to estimate how much processing is needed to obtain several fields of that patent, such as dates, claims, titles, etc., it says that it consumes around 300 Gb!!! I could not see in the dashboard what was my usage, since in the free plan it does not appear, but I surely run our of the monthly free processing data just testing a few times! How is it possible? I must be doing something wrong, it cannot be that finding a patent number takes 2-3 Gb, and getting that patent related data consumes one hundred times more. If it is already found!, can't it be like accessing linked data without searching whole columns every time? I am sorry for the dumb questions (I know nothing of SQL) but I just cannot understand how is this possible. I mean, if I enter the webpage of Google Patents, I can search for patents all day long, for free, but if I search for patent data with a publication number through the API, I only get to do it like 3-4 times in a month? while processing gargantuan amounts of data? That does not make much sense, what am I missing here?

This is an example of the query that processes all the data in the world:

query = f"""
            SELECT
              publication_number,
              ARRAY(
                SELECT text
                FROM UNNEST(title_localized)
                WHERE language = "en"
              ) AS title_en,
              ARRAY(
                SELECT text
                FROM UNNEST(abstract_localized)
                WHERE language = "en"
              ) AS abstract_en,
              filing_date,
              application_number,
              assignee,
              ARRAY(
                SELECT text
                FROM UNNEST(claims_localized)
              ) AS claims_text
            FROM
              `patents-public-data.patents.publications`
            WHERE
              publication_number IN UNNEST(@pub_numbers)
            """

Thanks for any insight that you can give me.

By the way, I could not find in the database the patent publication "US-20150258241-A1" even though it appears in the Google webpage, what I can find though is its publication number once it was granted (US9370603B2). Since I cannot play around much, I don't know if it is that particular patent application (published) number that does not appear or is it in general, or maybe for US patents they do not store the patent application (publication) number, and only the US application numbers that do not have kind code, in this case US14/657,274.


Solution

  • You started with an example which has a lot of data and thus is maybe not a good starting task.

    There are several tables in the dataset patents-public-data of different years. Thus I assume that you need to query the right table to find a publication_number. Still each of these tables has a size of several TB.

    There are three ways to minimize query costs:

    Since the original tables are not partitioned, we query this table once filtered for the US subset (LIKE "US%") and add a partition column named number. We save this result to have for the following queries less costs:

    CREATE OR REPLACE TABLE
      us.publication_201710_us
    PARTITION BY
      RANGE_BUCKET(number,GENERATE_ARRAY(-1,10000,3)) AS
    SELECT
      IFNULL(SAFE_CAST("0"||REGEXP_EXTRACT(publication_number,r"-(\d{1,4})") AS int64),0) AS number,
      *
    FROM
      `patents-public-data.patents.publications_201710`
    WHERE
      publication_number LIKE "US%"
    

    For tempory usage, this table can be found in a public accessible location: decent-stacker-214916.us.publication_201710_us

    SELECT
    publication_number,number,
     # *
    FROM
      `decent-stacker-214916.us.publication_201710_us`
    WHERE
      publication_number = "US-9370603-B2"
      AND number=9370 -- first 4 digit of publication_number --
    

    Please examine yourself by query all columns (*) how the number of columns effects the billing. The last where filter reduces the query by only looking a partitioned section of the table for the data. The partition was done by the first 4 digits of the publication_number.

    Searching for the "US-20150258241-A1" will return no results. But you know the title "Injectable alloplastic implants and methods of use thereof" from the link to the patent publication.

    SELECT
      publication_number,
      number,
      text,
      #*
    FROM
      `decent-stacker-214916.us.publication_201710_us`,
      UNNEST(title_localized) AS text
    WHERE
      UPPER(text.text) LIKE UPPER("%Injectable%alloplastic%implants%")
      AND number=2015