sql-serversample-datacopy-database-wizard

Need a 10% Copy of my SQL Server Database for Development work


I would like to take a 10% copy of the data in a production database, keeping its integrity and restore it into a new database.

Is there a method that allows this to be done in SQL Server? I have looked at creating an SSIS that exports the database schema and data, then putting a row sampling task to reduce the amount of data flowing into the new database but I wondered if there is a better method of doing this?


Solution

  • This is a hard topic, and it does take some work. There are a few approaches here, and I'll add something in addition to what HoneyBadger noted.

    First, I do know of Data Bee that subsets databases, but this might not be sufficient for you. You can use a trial of this to see if it works.

    Second, I generally recommend you get a curated data set for development areas (dev, test, UAT, etc) that contains the cases of the problem domains you need to solve. There are two ways of doing this. One is data virtualization software, which is what Redgate uses in SQL Clone, Delphix in their products, and a few others. This essentially copies production once and then shares that out to all devs/qa/etc. It reduces much of the time/storage required for getting copies. That can help.

    The other way is to build a dataset, and here's what I do for some clients.

    If you are working in mortgages, as HoneyBadger notes, you might need to account for fixed loans, variable APR, various terms, etc. Often the easiest way to do this is learn what cases your customers are asking about for new features, or what business analysts use and then copy those items to a new database. These will likely be relative few rows of tranasactional data and all lookup type data.

    This is also an ongoing process as you realize you've missed things. Save that database in VCS or a known location and use it as a source for your system. If your devs, your build system, your QA, all pull from here, you get a consistent set of data. You can augment this with random data, something like Redgate Data Generator, to help fill in some values.

    An additive approach is often much easier than subtractive. With that, also keep in mind that masking/obfuscating sensitive data is important. I would be more careful here as the GDPR and other legislation becomes enforced.

    Disclosure: I work for Redgate Software.