sqlsnowflake-cloud-data-platform

Get identity of row inserted in Snowflake Datawarehouse


If I have a table with an auto-incrementing ID column, I'd like to be able to insert a row into that table, and get the ID of the row I just created. I know that generally, StackOverflow questions need some sort of code that was attempted or research effort, but I'm not sure where to begin with Snowflake. I've dug through their documentation and I've found nothing for this.

The best I could do so far is try result_scan() and last_query_id(), but these don't give me any relevant information about the row that was inserted, just confirmation that a row was inserted.

I believe what I'm asking for is along the lines of MS SQL Server's SCOPE_IDENTITY() function.

Is there a Snowflake equivalent function for MS SQL Server's SCOPE_IDENTITY()?

EDIT: for the sake of having code in here:

CREATE TABLE my_db..my_table
(
    ROWID INT IDENTITY(1,1),
    some_number INT,
    a_time TIMESTAMP_LTZ(9),
    b_time TIMESTAMP_LTZ(9),
    more_data VARCHAR(10)
);
INSERT INTO my_db..my_table
(
    some_number,
    a_time,
    more_data
)
VALUES
(1, my_time_value, some_data);

I want to get to that auto-increment ROWID for this row I just inserted.


Solution

  • NOTE: The answer below can be not 100% correct in some very rare cases, see the UPDATE section below

    Original answer

    Snowflake does not provide the equivalent of SCOPE_IDENTITY today.

    However, you can exploit Snowflake's time travel to retrieve the maximum value of a column right after a given statement is executed.

    Here's an example:

    create or replace table x(rid int identity, num int);
    insert into x(num) values(7);
    insert into x(num) values(9);
    -- you can insert rows in a separate transaction now to test it
    select max(rid) from x AT(statement=>last_query_id());
    ----------+
     MAX(RID) |
    ----------+
     2        |
    ----------+
    

    You can also save the last_query_id() into a variable if you want to access it later, e.g.

    insert into x(num) values(5);
    set qid = last_query_id();
    ...
    select max(rid) from x AT(statement=>$qid);
    

    Note - it will be usually correct, but if the user e.g. inserts a large value into rid manually, it might influence the result of this query.

    UPDATE

    Note, I realized the code above might rarely generate incorrect answer.

    Since the execution order of various phases of a query in a distributed system like Snowflake can be non-deterministic, and Snowflake allows concurrent INSERT statements, the following might happen

    Note that T1 is later than T2. Now, when we try to do SELECT ... AT(statement=>Q1), we will see the state as-of T1, including all changes from statements before, hence including the value 2 from Q2. Which is not what we want.

    The way around it could be to add a unique identifier to each INSERT (e.g. from a separate SEQUENCE object), and then use a MAX.

    Sorry. Distributed transactions are hard :)