amazon-web-servicesamazon-dynamodbamazon-dynamodb-index

What is the max duration for the creation of a DynamoDB GSI?


I understand that whenever I create a Global Secondary Index (GSI) for a DynamoDB table it will take some time to create that GSI (depending on table size).

From what I understood, loading the items from the base table to the GSI only consumes the WCU of the GSI.

Let's assume I have a DynamoDB table with Terrabytes of data in it. if I create a GSI with 1 WCU, how long will it take for the GSI to be created (if all the items and values have to be projected)? Could it be high values such as multiple months ? (the doc states it takes around 5 minutes)


Solution

  • Indeed, when you add a GSI on a table with pre-existing data, a so-called backfilling process begins that reads the table's data and writes it to the GSI. There is no guarantee that this process can finish in 5 minutes. The documentation explains that the base table's RCU are not used, but the new indexes WCU are used, so if you provision too few WCU on the index, the backfilling will be slow. For example, this document, section "Adding a Global Secondary Index to a large table", says that:

    The time required for building a global secondary index depends on several factors, such as the following: ... The provisioned write capacity of the index ... If you are adding a global secondary index to a very large table, it might take a long time for the creation process to complete.

    ... If the provisioned write throughput setting on the index is too low, the index build will take longer to complete. To shorten the time it takes to build a new global secondary index, you can increase its provisioned write capacity temporarily. As a general rule, we recommend setting the provisioned write capacity of the index to 1.5 times the write capacity of the table. This is a good setting for many use cases. However, your actual requirements might be higher or lower.

    The document recommends that you look at the OnlineIndexPercentageProgress CloudWatch metric to understand the amount of progress that the backfilling is making.

    The same document also raises two more reasons why the backfilling process might be slower than you hoped:

    1. Writes to the table happening in parallel with the backfilling. Remember that these writes also need to be written to the GSI, so they take some of the provisioned WCU of the GSI table, meaning that backfilling has to write more slowly.
    2. Although you don't pay for the backfilling process's reading of the original table, the above document says that "DynamoDB uses internal system capacity to read from the table.", there is still a limit on how quickly this can happen. It is possible that even if you provision the GSI's WCU to 100,000, DynamoDB will simply not be able to read 100,000 items per second from the base table, and the backfilling process's speed will be limited by the unused spare capacity of the nodes holding your base index.