azure-blob-storageazure-stream-analyticsstream-analytics

How to prevent doubling values on Stream analytics due to blob input


We see an issue that on stream analytics when using a blob reference input. Upon restarting the stream, it prints double values for things joined to it. I assume this is an issue with having more than 1 blob active during the time it restarts. Currently we pull the files from a folder path in ADLS structured as Output/{date}/{time}/Output.json, which ends up being Output/2021/04/16/01/25/Output.json. These files have a key that the data matches on in the stream with:

    IoTData
LEFT JOIN kauiotblobref kio
ON kio.ParentID = IoTData.ConnectionString

which I don't see any issue with, but those files are actually getting created every minute on the minute by an azure function. So it may be possible during the start of stream analytics, it grabs the last and the one that gets created following. (That would be my guess, but I'm not sure how we would fix that).

Here's a visual in powerBI of the issue:

Peak

Trough

This is easily explained when looking at the cosmosDB for that device it's capturing from, there are two entries with the same value, assetID, timestamp, different recordID(just means cosmosDB counted it as two separate events). This shouldn't be possible because we can't send duplicates with the same timestamp from a device.


Solution

  • This seems to be a core issue with blob storage on stream analytics, since it traditionally takes more than 1 minute to start. The best way I've found to resolve is to stop the corresponding functions before starting stream back up. Working to automate through CI/CD pipelines, which is good practice anyways for editing the stream.