[SOLVED] Duplicate record handling with Azure Data Explorer

Duplicate record handling with Azure Data Explorer

I'm having the following scenario:

We have ADX tables that ingest data via Event Hubs. The amount of records for some tables are over 900 million records.

This is a constant stream of data that is being ingested into ADX, and we have a frontend portal that queries the ADX data and creates project specific visualizations for our clients.

Duplicate records however have a great impact in how the data is visualized, because there are calculations within the KQL queries that we launch from our frontend.

We have to make sure that the event hubs do not publish duplicate data into our ADX tables.

We also have a backfill component that allows us to backfill data for specific clients & specific scenarios. However, when we backfill, we also need to make sure that we don't create duplicate records in our ADX tables

I've looked at the following article:

https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates

But:

Handling duplicate rows during query would have a great impact on query performance, and we need the fast query engine ADX delivers.
Using materialized views to deduplicate come with an additional cost which might get quite high considering we have billions of records
Use soft delete to remove duplicates is not advised to deduplicate records
ingest-by extent tags also doesn't seem ideal because they also impact performance a lot

I'm thinking of using update policies on the tables that ingest the data, add a hash column and afterwards only keep the records that are not yet present in target table of the update policy.

We could then set a TTL of a few hours on the tables that ingest the data (to keep storage costs to a minimum) but I was wondering if there are other options that somebody might have come across.

We do have an option to manipulate, or add fields before publishing the data records to the event hubs.

I was wondering if somebody else came across this duplicate data issue when handling A LOT of data, and how they fixed it?

Thanks!

Solution

In the end it seemed that the deduplication using materialized view was the most performant approach because the ingestion latency was starting to get really high using a custom deduplication mechanism without materialized views. The only option was to upscale the SKU but that itself also has a great impact on cost.

However, deduplication using the materialized view approach also comes with a certain load on the ingestion process when working with billions of rows.