powerbi etl presto amazon-athena incremental-load

Power BI Athena Incremental Refresh

I have been successfully using Power BI’s incremental refresh daily with a MySQL data source. However, I can't get this configured with AWS Athena, because seemingly the latter interprets the values in the required parameters RangeStart and RangeEnd as strings. Since the data source is around 50 million rows I’d rather avoid querying this from scratch every day.

In this video from Guy in a Cube, you can clearly see that the query sent by Power BI to Azure has a convert to datetime2 function - something like this is presumably missing for Athena/Presto, which needs the type constructor TIMESTAMP in order to do datetime comparisons (https://stackoverflow.com/a/38041684/3675679), and of course incremental refresh must be based on datetime fields. I am using the datetime field adv_date for the incremental load.

Here is what the M query looks like in Power Query Editor:

= Table.SelectRows(#"Removed Columns1", each [adv_date] >= RangeStart and [adv_date] < RangeEnd)

And here is the resultant error message in Athena:

Your query has the following errors:SYNTAX_ERROR: line 1:1: Incorrect number of parameters: expected 2 but found 0

Whilst this is how Athena interprets the query:

    select "col1", "col2", "adv_date" 
    from "AwsDataCatalog"."test"."test_table" 
    where "adv_date" >= ? and "adv_date" < ?

I have contacted Power BI support without success. Does anyone have a workaround for this by any chance? Happy to provide more info if needed.

Solution

I think you are trying to fix Filtered Rows step, but might be able to achieve incremental load by fixing Step 1 - Source (running actual direct query to Athena)

Pasting my answer on this from another question thread:

I think I have managed to achieve the "Incremental Load" in Power BI using Athena. This (still) does not allow you to view Native query but you can still make Power BI manipulate the direct query to implement it.

To avoid full scan of S3 data in Athena - you have to enable Partitions in your dataset. Without going off topic, once you partition the S3 data via Athena you can then pin point the datasets with days/months/years without scanning your whole dataset.

Once you do that, you can achieve the Incremental Load by running Direct Queries as mentioned in this video (20:00 onwards) and achieve resource-efficient query execution.

The final query will look something like -

Odbc.Query("dsn=Simba Athena", 
    "SELECT * FROM tablename 
    WHERE year >= " & DateTime.ToText(RangeStart, "yyyy") & "
AND month >= " & DateTime.ToText(RangeStart, "MM") & "
AND day >= " & DateTime.ToText(RangeStart, "dd") & "
AND year <= " & DateTime.ToText(RangeEnd, "yyyy") & "
AND month <= " & DateTime.ToText(RangeEnd, "MM") & "
AND day <= " & DateTime.ToText(RangeEnd, "dd") & "
")

EDIT #1: OR simply

    Odbc.Query("dsn=Simba Athena", 
        "SELECT * FROM tablename 
        WHERE dt >= '" & DateTime.ToText(RangeStart, "yyyy/MM/dd") & "'
        AND dt <= '" & DateTime.ToText(RangeEnd, "yyyy/MM/dd") & "'
    ")