kqlazure-data-explorerkusto-explorer

materialized view || Low memory condition || ADX || KQL


I am trying to create materialized view on ADX using below query.

I have duplicate data in my source table and need to remove duplicates and load data into my target table using materialized view.

When I apply distinct function query is failing with LOW Memory issue, Even I tried with low Concurrency to 1 and MaxSourceRecordsForSingleIngest from 30 million to 3k records but it is still falling with Low Memory Issue.

When I remove distinct function, processing all records including duplicates, materialized query is running fine. But this is not our requirement, I need to remove row duplicates and process records.

Below is my source table looks like. And achieve target table using materialized view

let MySourceTable = datatable(col1:int, col2:int, Value:int, EventUTCDateTime:datetime)
[
1,2,10,'2022-01-01 01:00:01',
1,2,10,'2022-01-01 01:00:01',
1,2,30,'2022-01-01 01:05:01',
4,5,40,'2022-01-01 01:30:01'
];
MySourceTable
| project col1, col2, Value, EventUTCDateTime
| distinct *
| extend EventUTCDate = bin(EventUTCDateTime, 30m)
| project col1, col2, Value, EventUTCDate
| summarize cnt = count(), Totalvalue = sum(Value) by col1, col2, EventUTCDate

Below is query trying to create Materialized view on top of Source table

.create async materialized-view with ( 
    backfill=true,         
    MaxSourceRecordsForSingleIngest=3000,
        Concurrency=1, 
    effectiveDateTime=datetime(2022-01-01)
)
MyMaterialisedView  on table MySourceTable
{
MySourceTable
| project col1, col2, Value, EventUTCDateTime
| distinct *
| extend EventUTCDate = bin(EventUTCDateTime, 30m)
| project col1, col2, Value, EventUTCDate
| summarize cnt = count(), Totalvalue = sum(Value) by col1, col2, EventUTCDate
}

Below is error while creating materialized view

"State": Failed,
"Status": Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'shard: 4db9aa8d-b8f4-4ad8-82ba-7a0488c28b3a, source: (hr: '2161770503' 'Engine under memory pressure, context: collector datum materialize'): ', details: 'StgError { kind: Generic("shard:xxx), source: Some(StgError { kind: HResult(2161770503, "Engine under memory pressure, context: collector datum materialize"), source: None }) }'). (0th of 2 in an AggregateException with message: One or more errors occurred.)

Below is required output, Is there any features or configuration or query changes or any tips to over come this problem.

enter image description here


Solution

  • distinct * is a very heavy and memory consuming operation. It's identical to summarize by all-columns which means you're using 2 summarize operators in a materialized view, and that is not supported. The command should have failed due to this violation of using 2 summarize.

    The way to perform aggregations over deduplicated data using materialized views it to create one view with take_any() aggregation for deduplication and then create a materialized view over materialized view for the second aggregation (sum(), count()). In the first view (take_any()) it's better to choose the minimal set of columns for deduplication, and add the rest in the take_any(), than to add all columns in the group by keys.