azure-devopstfsazure-pipelinesmigrationdacpac

Migrating from TFS to VSTS - Reducing TPC Database Size


We are planning a migration from an on-premises TFS instance to VSTS very shortly. Ahead of the migration, we run the pre-requisite Validation task and obtained the following output report on the TPC database size:

"The database is currently 191GBs. This is above the recommended size of 150GBs to use the DACPAC import method. The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method.

Validation completed 'Validate Collection Database Size' with result Warning, message The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method."

We are therefore keen to reduce the size of the TPC database and have two main considerations:

  1. Shrink the Database and generate the DACPAC from the resultant output.

  2. Delete any of the following objects which are unused or redundant:

    a) Older Workspaces
    b) Build Results
    c) Redundant Team Projects
    d) Unused Files
    e) Test Attachments created during test runs
    f) XAML Builds

Would therefore appreciate some advice or feedback on the pros and cons of either approach and which would be recommended.


Solution

  • Most up-to-date version of this guidance is on my blog: Clean up your Team Project Collection prior to migrating to VSTS. The guidance is still pretty much spot on for recent versions of Azure DevOps Server.


    Given that you need to reduce your largest table by 150GB, I wonder whether DACPAC is every going to be an option. That said, it's always a good idea to clean up your TFS instance. Your first step won't help a lot until you've managed to strip enough data to actually get any benefit out of a shrink.

    Your identified actions would indeed help, most are already documented here. Queries that can aid in detecting where your space is allocated are also found in this recent support ticket.

    Delete old workspaces

    Deleting workspaces and shelvesets can reduce your migration and upgrade times considerably. either use the tf commandline or leverage a tool like the TFS SideKicks to identify and delete these.

    Build results

    Not just build results, but often overlooked the actual build records can take up a considerable amount of data. Use tfsbuild destroy (XAML) to permanently delete the build records. In the past, I've encountered clients who had 1.8 million "hidden" builds in their database and removing them shaved off quite a considerable amount of data. These records were kept around for the warehouse.

    Old team projects

    Of course, destroying old team projects can give back a lot of data. Anything you don't need to send to azure helps. You could also consider splitting the collection and to leave behind the old projects. That will give you the option to detach that collection and store it somewhere, should you ever need that data again.

    Redundant files

    Deleted branches are a very common hidden size hog. When deleting things in TFVC, they are not actually deleted, they're just hidden. Finding deleted files and especially old development or feature branches can give you back a lot of data. Use tf destroy to get rid of them.

    You may also want to look for checked in nuget package folders, those can quickly rack up a lot of space as well.

    Test Attachments

    Ohh yes, especially when you use test attachments, these can grow like crazy, depending on your TFS version either use the built-in test attachment cleanup features or use the Test Attachment Cleaner from the TFS power tools.

    XAML Builds

    The build definitions themselves won't take a lot of db space, but the build results may. But those have been covered in a previous section.

    Git Repositories

    You may have data in your git repositories that are no longer accessible due to force pushes or deleted branches. It's also possible that certain data in Git could be packed more efficiently. To clean your repositories you have to clone them locally, clean them up, delete the remote repo from TFS and push the cleaned copy to a new repository (you can use the same name as the old one). Doing this will break references with existing build definitions and you will have to fix these up. While you're at it, you could also run the BFG repo Cleaner and convert the repositories to enable Git-LFS support to handle large binary files in your repositories more elegantly.

    git clone --mirror <<repo>>
    # optionally run BFG repo cleaner at thi s point
    git reflog expire --expire=now --all 
    git gc --prune=now --aggressive
    git repack -adf
    # Delete and recreate the remote repository with the same name
    git push origin --all
    git push origin --tags
    

    Work item (attachments)

    Work items can gather up a considerable amount of data, especially when people start attaching large attachments to them. You can use witadmin destroywi to delete workitems with unreasonably large attachments. To retain the work item, but delete its attachments you can delete the attachments from the current work item and then clone it. After cloning, destroy the old work item to allow the attachments to be cleaned up.

    Old work items that you no longer need (say the sprint ites from 6 years ago) can also be deleted. My colleague Rene has a nice tool that allows you to bulk-destroy by first creating the appropriate work item query.

    Be sure to run the cleanup jobs

    TFS often doesn't directly prune data from the database, in many cases it just marks stuff as deleted for latest processing. To force the cleanup to happen immediately, run the following stored procedures on your Project Collection database:

    EXEC prc_CleanupDeletedFileContent 1
    # You may have to run the following command multiple times, the last
    # parameter is the batch size, if there are more items to prune than the 
    # passed in number, you will have to run it multiple times
    EXEC prc_DeleteUnusedFiles 1, 0, 100000
    

    Other useful queries

    To identify how much data is stored in each section, there are a few useful queries you can run. The actual query depends on your TFS version, but since you're preparing for migration I suspect you're on TFS 2017 or 2018 at the moment.

    Find the largest tables:

    SELECT TOP 10 o.name, 
    SUM(reserved_page_count) * 8.0 / 1024 SizeInMB,
    SUM(CASE 
    WHEN p.index_id <= 1 THEN p.row_count
    ELSE 0
    END) Row_Count
    FROM sys.dm_db_partition_stats p
    JOIN sys.objects o
    ON p.object_id = o.object_id
    GROUP BY o.name
    ORDER BY SUM(reserved_page_count) DESC
    

    Find the largest content contributors:

    SELECT Owner = 
    CASE
    WHEN OwnerId = 0 THEN 'Generic' 
    WHEN OwnerId = 1 THEN 'VersionControl'
    WHEN OwnerId = 2 THEN 'WorkItemTracking'
    WHEN OwnerId = 3 THEN 'TeamBuild'
    WHEN OwnerId = 4 THEN 'TeamTest'
    WHEN OwnerId = 5 THEN 'Servicing'
    WHEN OwnerId = 6 THEN 'UnitTest'
    WHEN OwnerId = 7 THEN 'WebAccess'
    WHEN OwnerId = 8 THEN 'ProcessTemplate'
    WHEN OwnerId = 9 THEN 'StrongBox'
    WHEN OwnerId = 10 THEN 'FileContainer'
    WHEN OwnerId = 11 THEN 'CodeSense'
    WHEN OwnerId = 12 THEN 'Profile'
    WHEN OwnerId = 13 THEN 'Aad'
    WHEN OwnerId = 14 THEN 'Gallery'
    WHEN OwnerId = 15 THEN 'BlobStore'
    WHEN OwnerId = 255 THEN 'PendingDeletion'
    END,
    SUM(CompressedLength) / 1024.0 / 1024.0 AS BlobSizeInMB
    FROM tbl_FileReference AS r
    JOIN tbl_FileMetadata AS m
    ON r.ResourceId = m.ResourceId
    AND r.PartitionId = m.PartitionId
    WHERE r.PartitionId = 1
    GROUP BY OwnerId
    ORDER BY 2 DESC
    

    If file containers are the issue:

    SELECT CASE WHEN Container = 'vstfs:///Buil' THEN 'Build'
    WHEN Container = 'vstfs:///Git/' THEN 'Git'
    WHEN Container = 'vstfs:///Dist' THEN 'DistributedTask'
    ELSE Container 
    END AS FileContainerOwner,
    SUM(fm.CompressedLength) / 1024.0 / 1024.0 AS TotalSizeInMB
    FROM (SELECT DISTINCT LEFT(c.ArtifactUri, 13) AS Container,
    fr.ResourceId,
    ci.PartitionId
    FROM tbl_Container c
    INNER JOIN tbl_ContainerItem ci
    ON c.ContainerId = ci.ContainerId
    AND c.PartitionId = ci.PartitionId
    INNER JOIN tbl_FileReference fr
    ON ci.fileId = fr.fileId
    AND ci.DataspaceId = fr.DataspaceId
    AND ci.PartitionId = fr.PartitionId) c
    INNER JOIN tbl_FileMetadata fm
    ON fm.ResourceId = c.ResourceId
    AND fm.PartitionId = c.PartitionId
    GROUP BY c.Container
    ORDER BY TotalSizeInMB DESC