google-bigqueryquery-optimizationcost-based-optimizercost-management

How to limit BigQuery query size for testing a query sample through the web user-interface?


I would like to know if it is possible to limit the bigquery query size when running a query through the web user-interface?

My idea is just to test the query but instead of querying all my tables; I would like just to query a part of it with for instance a number of row.

Limit is not optimizing my query cost, so the idea is to find a function similar to "row_number" or "fetch".

Sorry I'm a marketer and not a developer, so thank you in advance for your kind help.


Solution

  • How to limit BigQuery query size for testing ... ?

    1 - Try to minimize number of tables involved in your testing In your query – there are 60+ tables involved for respectively dates between 2016-12-11 and nowadays

    SELECT <fields_list> FROM
    TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
      TIMESTAMP('20161211'),
      TIMESTAMP('20170315'))
    

    Instead you can use same day as start and end of time range, thus drastically reducing number of involved tables (down to just one table) and overall scan size. For example

    SELECT <fields_list> FROM
    TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
      TIMESTAMP('20161211'),
      TIMESTAMP('20161211'))
    

    2 - Minimize number of rows. Ability to do so really depends on how your table is being loaded with data. If table loaded incrementally - you can use so called table decorators.
    Note - this technique works with tables within last 7 days

    For example, below will scan only data that was in table at one hour ago (so called snapshot decorator)

    SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212@-3600000]  
    

    This works well with the most recent day's table especially at the start of the day when size of table is not big yet

    So, to limit further, you can use below version (so called range decorator) - gives you data added between one hour and half an hour ago

    SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212@-3600000--1800000]  
    

    Finally, @0 is a special case that references the oldest possible snapshot of the table: either 7 days in the past, or the table's creation time if the table is less than 7 days old. For example

    SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170210@0]  
    

    3 - Test against Sampled Table. If you expect experimenting with your query again and again - you can first prepare downsized version of your table with just as many rows as you need and applying sampling logic that fit in your business logic. To limit number of rows you can use LIMIT Clause. To get random rows you can use RAND function for example
    After sampled table is prepared - run all your query against it till when you have final version - after this - you can run it against your original table(s)
    And btw, to create sampled table you need to set destination table under options in Web UI.