sqlsnowflake-cloud-data-platform

Get size of staged files from an external stage in snowflake


I am trying to get the number of files and sum of their sizes in the external stage in snowflake so that I can assign appropriate warehouse while doing copy operation. I could get number of files using

count(distinct METADATA$FILENAME)

select count(distinct METADATA$FILENAME)  from @STAGE_NAME
where metadata$file_last_modified > (select max(end_time) from <table_name>);

But I am not able to get the size of the files by querying the stage files. I see that LIST command gives me the size of the files, but I am not sure how to use it in conjunction with the above query. Can someone help me?


Solution

  • Snowflake external stage metadata does not directly provide the size information.

    Alternative approach:

    You can use LIST to get the size information like you mentioned.

    LIST @mystage
    

    returns

    Note: last_modified is in Varchar, but you want to do date comparison. so here is a UDF to convert the last_modified to a timestamp which can be used to compare date or timestamp.Sample SO answer for the UDF.

    create or replace function LAST_MODIFIED_TO_TIMESTAMP(LAST_MODIFIED string) 
    returns timestamp_tz
    as
    $$
        to_timestamp_tz(left(LAST_MODIFIED, len(LAST_MODIFIED) - 4) || ' ' || '00:00', 'DY, DD MON YYYY HH:MI:SS TZH:TZM')
    $$;
    

    Sample query:

    LIST @my_internal_stage;
    
    SELECT $1 AS file_name,count(DISTINCT $1) AS count_files, SUM(DISTINCT $2 ) AS size
    FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
    WHERE LAST_MODIFIED_TO_TIMESTAMP("last_modified") > '2025-01-01'
    GROUP  BY file_name ;
    

    Make sure to run SELECT just after the LIST, since LAST_QUERY_ID uses the last executed query and result_scan returns its resultset.

    And you can use distinct on count and sum for the files

    Above returns

    enter image description here