I am trying to create materialized view on ADX using below query.
I have duplicate data in my source table and need to remove duplicates and load data into my target table using materialized view.
When I apply distinct function query is failing with LOW Memory issue, Even I tried with low Concurrency to 1 and MaxSourceRecordsForSingleIngest from 30 million to 3k records but it is still falling with Low Memory Issue.
When I remove distinct function, processing all records including duplicates, materialized query is running fine. But this is not our requirement, I need to remove row duplicates and process records.
Below is my source table looks like. And achieve target table using materialized view
let MySourceTable = datatable(col1:int, col2:int, Value:int, EventUTCDateTime:datetime)
[
1,2,10,'2022-01-01 01:00:01',
1,2,10,'2022-01-01 01:00:01',
1,2,30,'2022-01-01 01:05:01',
4,5,40,'2022-01-01 01:30:01'
];
MySourceTable
| project col1, col2, Value, EventUTCDateTime
| distinct *
| extend EventUTCDate = bin(EventUTCDateTime, 30m)
| project col1, col2, Value, EventUTCDate
| summarize cnt = count(), Totalvalue = sum(Value) by col1, col2, EventUTCDate
Below is query trying to create Materialized view on top of Source table
.create async materialized-view with (
backfill=true,
MaxSourceRecordsForSingleIngest=3000,
Concurrency=1,
effectiveDateTime=datetime(2022-01-01)
)
MyMaterialisedView on table MySourceTable
{
MySourceTable
| project col1, col2, Value, EventUTCDateTime
| distinct *
| extend EventUTCDate = bin(EventUTCDateTime, 30m)
| project col1, col2, Value, EventUTCDate
| summarize cnt = count(), Totalvalue = sum(Value) by col1, col2, EventUTCDate
}
Below is error while creating materialized view
"State": Failed,
"Status": Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'shard: 4db9aa8d-b8f4-4ad8-82ba-7a0488c28b3a, source: (hr: '2161770503' 'Engine under memory pressure, context: collector datum materialize'): ', details: 'StgError { kind: Generic("shard:xxx), source: Some(StgError { kind: HResult(2161770503, "Engine under memory pressure, context: collector datum materialize"), source: None }) }'). (0th of 2 in an AggregateException with message: One or more errors occurred.)
Below is required output, Is there any features or configuration or query changes or any tips to over come this problem.
distinct *
is a very heavy and memory consuming operation. It's identical to summarize by all-columns
which means you're using 2 summarize operators in a materialized view, and that is not supported. The command should have failed due to this violation of using 2 summarize.
The way to perform aggregations over deduplicated data using materialized views it to create one view with take_any()
aggregation for deduplication and then create a materialized view over materialized view for the second aggregation (sum()
, count()
). In the first view (take_any()
) it's better to choose the minimal set of columns for deduplication, and add the rest in the take_any()
, than to add all columns in the group by keys.