Is there any way to preserve the load history when re-creating pipes (using CREATE OR REPLACE
)?
We do a lot of automated CI/CD on Snowflake, and sometimes pipes need to get re-created. When this happens, the load history is lost. Right now, the accepted workaround is a manual process, which doesn't work very well in an automated workflow.
This makes refreshing pipes dangerous, as duplicate data could be loaded. There is also a danger of losing some notifications/files while the pipe is being re-created -- with or without the manual process, automated or not (which is unacceptable, for obvious reasons).
I wish there was a simple parameter to enable this. Something like:
CREATE OR REPLACE PIPE my_pipe
PRESERVE_HISTORY = [ TRUE | FALSE ]
AS <copy_statement>
An alternative to this would be an option/parameter for pipes to share the load history with the table instead. This way, when the pipe is re-created (but the table isn't), the load history is preserved. If the table is dropped/truncated, then the load history for both the table and the pipe would be lost.
Another option would be the ability to modify pipes using an ALTER
command instead, but currently this is very limited. This way, we wouldn't even need to re-create the pipe in the first place.
EDIT: Tried automating the manual process with a procedure, but there's a still chance of losing notifications.
Since preserving the load history doesn't seem possible currently, I explored a few alternatives:
tl;dr: Here is the solution.
Deleting/Removing/Moving the files after ingestion
- Thank you @patrick_at_snowflake for the recommendation! 🙏
- This turned out to be a bit tricky to do with high reliability, because there's no simple way to tie the ingestion of files in Snowflake to their deletion/removal in could storage (i.e. life cycle management policies are not aware of whether or not the files were ingested successfully by Snowpipe).
- It could be possible to monitor the ingestion using a stream or COPY_HISTORY as a trigger for the deletion/removal of the files, but this is not simple (would probably require the use of an external function). Although there is also a native
REMOVE
command which could be used.
Refreshing a subset of the pipe
- Thanks @GregPavlik for the suggestion! 🙏
- The idea here would be to save the timestamp at which the initial pipe is paused/dropped. This timestamp could then be used to refresh the new pipe with a "safe" subset of the staged files (in order to avoid re-ingesting the same files and creating duplicates records).
- I think this is a great idea (my favorite so far), but I also had monitoring in mind and wanted to confirm that this would work, so I continued exploring alternatives for a while.
Replaying the missed notifications
- I asked a separate question about this here.
- The idea would be to simply replay the notifications that were neither processed by the initial pipe or the new pipe.
- However this doesn't seem possible either.
Monitoring the load of every staged file
- Finally, I arrived at this solution.
- This is the one I went with as it not only allows to refresh missing files, but also to monitor the loading of all staged files as a whole (no matter the source of the failure).
- I was already working on monitoring Snowpipe as part of a project, so this solution added another layer of monitoring. 👍