amazon-redshift

Redshift doesn't recognize newly loaded data as pre-sorted


I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.

However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted? Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?

I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).

Official Answer by Redshift Support:

Here is what we say officially: http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html

When your table has a sort key defined, the table is divided into 2 regions:

  • sorted, and
  • unsorted

As long as you load data in sorted key order, even though the data is in the unsorted region, it is still in sort key order, so there is no need for VACUUM to ensure the data is sorted. A VACUUM is still needed to move the data from the unsorted region to the sorted region, however this is less critical as the data in the unsorted region is already sorted.


Solution

  • Storing sorted data in Amazon Redshift tables has an impact in several areas:

    While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.

    I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.

    Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.

    Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.

    Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.

    Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.

    Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:

    I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.

    You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.