sqlamazon-web-servicesamazon-dynamodbamazon-redshiftdata-warehouse

DynamoDB Scan with Filter v/s Redshift query operation


I was going through dynamo DB scan operation and was wondering how it would be different from performing query operation on redshift.

A filter expression is applied after a Scan finishes but before the results are returned. Therefore, a Scan consumes the same amount of read capacity,regardless of whether a filter expression is present.

Is this case different for redshift queries or any SQL data warehouse?


Solution

  • The biggest advantage Redshift has in this case is in using zone maps (block metadata). Every 1MB (block) of column data has meta data associated with it that contains the min and max values for that block as well as max and min row numbers. For straight forward WHERE clauses based on a column (WHERE col_A = 42), Redshift compare this value to all the metadata for the column to decide which rows could possibly hold a matching value. THEN the table scan happens - reading only those blocks that can have needed data.

    This can massively speed up table scans and queries based on large tables. Or it can do nothing for you at all. It all comes down to that metadata and how selective it is and this is all about picking SORTKEYs that maximize the effectiveness of the metadata. If in the example above col_A is the sortkey then only a small percentage of blocks will have the min and max metadata that includes 42 but it it is sorted on some other non-correlated column then 42 might be between min and max for every block's metadata. This is why so much attention is place on SORTKEYs in Redshift when working with very large tables.

    There is also the question of what type of WHERE clauses will take advantage of the metadata check. The main factor is to have the WHERE clause constructed as <raw column> compared to <some pre-computable value(s)>. If you run a function on the column [CAST(col_A as text) = '42']this will break the metadata compare but [col_A = CAST('42' as int), won't. These metadata checks can be applied to as many columns as you like, each possibly reducing the number of rows needing to be scanned.

    This can give Redshift a significant edge over DynamoBD for scanning very large sets of data especially when a large subset of data needs to be processed (analytics workload). However, this edge only happens when the metadata enables it and setting this up usually takes some work.