azureazure-functionsazure-web-app-serviceazure-table-storageazure-tablequery

Querying one record from tens of millions of records in Azure Table Storage


I have a typical scenario where a consumer is calling a Azure Function (EP1) (synchronously) which then queries Azure Table storage (having 5 million records), based upon the input parameters of the Azure Function API. Azure Table Storage has following columns:

  1. Order Number (incremental number)
  2. IsConfirmed (can have value Y or N)
  3. Type of Order (can be of 6 types maximum)
  4. Order Date
  5. Order Details
  6. UUID

Now when consumer queries, it generally searches with the Order Number and expects the Order Date and Order Details in response, along with Order Number.

For this, we had chosen:

  1. Partition Key: IsConfirmed + Type of Order
  2. Row Key: UUID

Now for 5 million records search, because of the partition key type, the search partition often runs into more than 3 million records (maximum orders have IsConfirmed as Y and Type of Order a specific one among the six types) and the Table query takes more than 5 minutes. As a result, the consumer generally times out as the wait configured on consumer side is 60 secs.

So looking for recommendation on how to do this efficiently.

  1. Can we choose partition key as Order Number (but that will create 5 million partitions) or a combination of Order NUmber+IsConfirmed+TypeofOrder?
  2. Ours is a write heavy Java application and READ happens much less.

+++++++++++ UPDATE +++++++++++++++

As suggested by Gaurav in the answer, after making orderid as partition key, the query is working as expected.

Now that brings to the next problem - we do have other API queries where the order data and type are only used as input search criteria.

Since this doesn't match with the partition key, so in this 2nd type of query, its basically making a whole scan and the consumer is again timed out again.

So what should be the design to handle these types of queries.. Azure doc says creating a separate table where order type + order date becomes partition key. However that will mean that whenever we are writing to the table, we will have to write on both tables (one with orderid as part key and other as order date + type as part key).


Solution

  • Can we choose partition key as Order Number (but that will create 5 million partitions) or a combination of Order NUmber+IsConfirmed+TypeofOrder?

    You can certainly choose partition key as order number as there is nothing wrong in having large number of partitions. However, please keep in mind that partition key value is of string type. What you may want to do is pad your order number with some character (say 0) so that all of your orders are of the same length.

    In this case, I would actually recommend that you keep the row key as empty.

    You may also want to think about storing multiple copies of the same data with different partition key/row key combination depending on your querying requirements. For example, if you were to query by order date, you may want to make another copy of the data with order date as the partition key.

    Generally speaking it is recommended that you do point queries (query including both partition key and row key). Next best option would be to query by partition key (you would want to keep data in partition key small so that you're not doing partition scans). All other options would result in full table scan which is not at all recommended.

    You may find this link useful: https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-guidelines.