pythonamazon-dynamodbdatabase-scan

python dynamodb scan throughput error


I have a database that has segment_id,beat_id, patient_id

In dynamoDB version 2 when I do a scan with the following command I can only get values for 1 specific patient. When I input other segment,patients values I get a ThroughputExceededException.

table.scan(segment_id__eq='xCrKYvnfZlm6VCQ',beat_id__gt=1,patient_id__eq='3854520.edf')
  1. Why does it only work for 1 patient and give a ThroughputExceededException for others?

Solution

  • The scan you are performing reads every item in the DynamoDB table and returns it if it meets the specified conditions (segment_id__eq='xCrKYvnfZlm6VCQ',beat_id__gt=1,patient_id__eq='3854520.edf’). Each read (even if the item does not meet the conditions) consumes your provisioned read capacity. If you are looking to retrieve a single record, it will be most efficient to use the GetItem or BatchGetItem calls to DynamoDB because you will only consume read capacity for the specified items. If you are looking to retrieve a specific range of records, it will be more efficient to use a Range Key or Global or Local Secondary Index so that you can Query the items because you will only consume read capacity for all items meeting the query criteria. Could you please provide more information about the table schema?

    See this developer guide that describes the differences between scan and query in detail.

    An example of using a query would be if segment_id was the hash key and beat_id was the range key. You could query all records with a specified segment_id and specified beat_id range. This will only consume the read capacity required to retrieve those specific records, rather than reading the entire table. Additionally, you can apply a query filter to other attributes like patient_id so only the results you want are returned.

    More details on scan/query consumed capacity:

    Query and scan are both eventually consistent reads, so one read capacity unit will let you read at up to 8KB per second.

    If you still experience throttling, here are some ways to mitigate the exception:

    1. Increase the time between requests to keep your read rate under your provisioned read capacity. The SDK retries throttling exceptions by default.
    2. Increase your provisioned read capacity to account for the item size and request rate. See these resources on how provisioned throughput works and calculating item sizes.

    More details on scan pricing:

    To figure out how much read capacity you need to use Scan or Query to read items in your table:

    1. Figure out how much data you are reading (add up the size of all the items that will be read)
    2. Round up to the nearest multiple of 4KB
    3. Divide by 4KB (strongly consistent reads) or 8KB (eventually consistent reads) to get the number of capacity units that will be consumed.

    To figure out how much read capacity you need to use GetItem or BatchGetItem to read items in your table:

    1. For each individual item, round up that item’s size to the nearest multiple of 4KB
    2. Divide by 4KB (strongly consistent reads) or 8kb (eventually consistent reads) to get the number of capacity units that will be consumed by each individual item.
    3. Add up the capacity units that will be consumed by each item to get the total number of capacity units that will be consumed.

    As an example, suppose I have 10 items in my table, they are all 1KB, and I am planning to retrieve them all with eventually consistent operations. If I retrieve them with GetItem, each individual item will consume 1/2 of a read capacity unit, so the total cost will be 1/2 * 10 = 5 read capacity units. If I retrieve them with scan, the total size of all items combined is 10KB, which will consume 2 read capacity units.