azure-blob-storageazure-storageazure-eventhubazure-stream-analytics

Using Stream Analytics to Post Events to Event Hub from JSON Files in ADLS Gen2


I have a spark job that outputs individual json files to a storage account. I'm trying to use Stream Analytics (SA) to read the JSON and post an event in Event Hub. It seems like it should be super simple using the no-code editor. I just define my Input (ADLS Gen2) and my output (EventHub). SA can preview the data in the JSON files and all the test connections to input and output are successful. However, when I start the job and I create files in the folder path, SA sees them, I see the number of inputs as I might expect in the metrics, but I see no output events and my watermark delay just keeps going up and up. I don't see any errors other than an hour later where it says there is some sort of timeout. I'm only pushing like 12 files at a time, I'd be hard-pressed to say volume is an issue here.

All the documentation I see online is about moving data from EH to Storage. Nothing on the reverse. I'm just wondering if my json output is messed up somehow.

My SA query is about as simple as it can get. But maybe that's part of the problem:

SELECT * INTO eventhub FROM JsonFiles

It seems super hard to troubleshoot this thing. I can't see inputs, outputs and doesn't seem to generate errors, just, hey, your watermark delay keeps going up and you have no output events. WHY don't I have output events SA? I think the watermark delay means I have events to output, but I haven't output them yet. Help?


Solution

  • So I figured this out myself. My EventHub had a Cleanup policy of "Compact" and not "Delete". Apparently there is a requirement when pushing messages to an EventHub with "Compact" cleanup policy to have a PartitionKey included, which I was not including. The only way I found this out was the LogAnalytics table named AZMSDiagnosticErrorLogs. It had a single error repeated:

    compacted event hub does not allow null message key.

    There were no error messages anywhere else that I could find.

    So to fix, in my Stream Analytics output settings, I included a column for the Partition key column.

    enter image description here