sql-servercdc

How to read data from CDC tables by parts?


I have 2 large tables with CDC. The idea is to transfer data to some new tables. So, I'd like to read data from the CDC tables by parts and do not read repeatedly what already was read.

How it can be done?


Solution

  • The fn_cdc_get_net_changes_* function, (or get all changes if that's your requirement) which is created by SQL Server for a table when you enable it for CDC, takes a from_LSN and to_LSN parameter. This is your sliding window. The function returns a __$start_lsn column. Use that to keep track of where you're up to, and slide your window forward the next time you call it.

    Per your comment - yes, after getting the results from the change function, you should store the largest __$start_lsn value you received. This would then become your from_lsn parameter on the next call...

    ... Well, almost. There are a few "gotchas" when it comes to calling these change functions. It's not terribly complex, but it's too much information to provide in a StackOverflow answer. Fortunately there is Microsoft documentation on the subject. See the section on Validating LSN boundaries.

    Additionally, note that the from and to parameters to the change functions represent a closed interval. In other words, the function will return data greater than or equal to the from_lsn value, and less than or equal to the to_lsn parameter.

    So, suppose your algorithm looks like this:

    1. Read the largest LSN we got last time from persistent storage, and write that value into @variable.
    2. Bounds check @variable (as per the validating boundaries documentation)
    3. Call the change function, using @variable as the from_lsn parameter value.
    4. Get the largest __$start_lsn from the result set and save that back to persistent storage.

    If you do this you will get some of the same rows you got last time - ie, any rows whose LSN is equal to the value in @variable.

    What you should therefore do instead is this:

    1. Read the largest LSN we got last time from persistent storage, and write that value into @variable.
    2. Use fn_cdc_increment_lsn to increment @variable.
    3. Bounds check @variable (as per the validating boundaries documentation)
    4. Call the change function, using @variable as the from_lsn parameter value.
    5. Get the largest __$start_lsn from the result set and save that back to persistent storage.

    To get the value you should use for the to_LSN parameter, you can use fn_cdc_get_max_lsn