javascriptamazon-dynamodbdynamoose

Scan until x items are found?


I'm writing a query to find all users whose usernames begin with a given prefix. I only want the query to return up to 10 items. My query right now is

User.scan('username')
  .beginsWith(req.query.prefix)
  .limit(req.query.limit)
  .exec((err, users) => {
    ...
  });

After reading the dynamoose docs for .limit() more carefully, I realized that the limit is on the number of table entries it checks, not the number of entries it returns. So if I have 10 users, 5 of which have usernames that start with 'm', a query like query: { prefix: 'm', limit: 5} could potentially return 0 items.

I understand that I could query for all the users in the database, and then only return some of them, but it wouldn't be scalable. How can I query the database so it stops looking through the table exactly when 10 matches have been found?


Solution

  • Doing what you want is not directly possible in DynamoDB. As you discovered, the Limit options says how many items to check - not how many to return. What one normally does when this needs arise is to pick some reasonable page size, e.g., 1000 items or (the default if you don't specify a number) 1MB of items, and then read such a page size at a time. If the first page already has more than your desired number of 5 items, you stop the scan (and don't need to scan the entire table). If the first page only resulted in one matching item to be returned, you continue to read additional pages - until you have found a total of 5 items.

    If you're curious why DynamoDB doesn't have an option to directly ask for "5 matching items", imagine what can happen if the entire database has fewer than 5 matching items. DynamoDB would need to scan the entire table to look for the first 5 items before returning anything. If this takes an hour, the client would not receive any result for a whole hour! Database clients don't normally work this way, and they time out before the hour has passed. More importantly, if the server restarts during this hour, the work is lost and cannot be resumed. So with DynamoDB's API the server only reads 1MB (or whatever) chunks of data, and after each chunk, the server returns to the client the answer "no data yet" - and the client needs to resume the scan. There is some overhead, but not much (presumably the work - and the cost - of scanning 1MB of data is more than sending the request), and all the above problems are avoided.

    Finally, another comment regarding your use case: whole-table scans, like you're doing, are very expensive - both in time and in cost. If you have a table with 1 million entries and expect the scan to only yield 5 items, you would be paying to read 1 million entries to get just 5. If these queries are common, you'll probably want to rethink your data model so that typical queries will require reading just one item - or at most querying just one partition - and not a full-table scan.