gitcvssvn2gitcvs2git

Drop history older than x on cvs2git Migration


we plan to migrate one of our last big CVS repositories in a Git repository.

For migration we are using svn2git's cvs2git. Because this CVS repository has grown over ~ 12 years, it has 31GB of data.

I couldn't find any solution to drop all history older than a specified period of time (2 years for example).

Do you know any tool/command/resolution for one of this?:

Thanks and greetings, Andreas

Solution as suggested by Dmitry Oksenchuk: After editing grafts, I wrote a BASH script tp clean up messed up tags and branches:

#!/bin/bash

NEW_ROOT_REF=$1
git tag --contains $NEW_ROOT_REF | sort  > TAGS_TO_KEEP.tmp
echo "Keep Tags:"
cat TAGS_TO_KEEP.tmp | wc -w

git branch --contains $NEW_ROOT_REF | sort  > BRANCHES_TO_KEEP.tmp
echo "Keep Branches:"
cat BRANCHES_TO_KEEP.tmp | wc -w

git tag -l | sort > TAGS_ALL.tmp
echo "All Tags:"
cat TAGS_ALL.tmp | wc -w

git branch -l | sort > BRANCHES_ALL.tmp
echo "All Branchess:"
cat BRANCHES_ALL.tmp | wc -w

# Remove tags
COUNTER=0
for drop in `comm TAGS_ALL.tmp TAGS_TO_KEEP.tmp -23`; do
        git tag -d $drop
        COUNTER=$[$COUNTER +1]
done
echo "Dropped tags: $COUNTER"

# Remove branches
COUNTER=0
for drop in `comm BRANCHES_ALL.tmp BRANCHES_TO_KEEP.tmp -23`; do
        git branch -D $drop
        COUNTER=$[$COUNTER +1]
done
echo "Dropped branches: $COUNTER"

# Clean up
rm TAGS_ALL.tmp TAGS_TO_KEEP.tmp BRANCHES_ALL.tmp BRANCHES_TO_KEEP.tmp

Solution

  • In a well-formed Git repo depth of the history is usually not an issue. In linux repo there are more than 500k commits and it works fine. This year we migrated a ~15 years old CVS repo (5GB of ,v files) to Git. The Git repo takes ~200MB and contains ~70k commits.

    We faced two major problems: binary files and the number of tags.

    Binary files

    In CVS all the revisions of binary files are stored on the server and only the current revision is trasferred on checkout. So it's not a problem at all to store binary files in CVS, you just need enough disk space on the server. With Git the situation is different. When you make a clone of a Git repo, all the revisions of binary files are transferred to your local clone. Even if a file doesn't exists in the most recent commit, all its historical revisions are in your local repo. We managed to shrink the size of Git repo from ~700MB to ~200MB by removing not necessary binary files from the history. The important point here is to base your decision on size of a file in Git, not in CVS. Git packs objects using zlib compression and delta compression, so the history of the same file can take totally different disk space in Git and in CVS. You can use the "Find large files" plugin in Git Extensions.

    Number of tags

    We have more than 20k build tags in CVS repo. With such number of tags both Git Extensions and Source Tree work extremly slow (especially when they need to load all the tags into a drop-down list). git push with Git 1.9.5 was also very slow because of performace regression fixed in Git 2.3.0. Currently in Git we keep only build tags for recent 2 years (~7k tags) periodically archiving older tags.

    Dropping old history

    If you still need it, it's much easier and safer to drop old history in Git than in CVS or during migration.

    1. Set new root commit in the grafts file: echo %commit_hash% >.git/info/grafts
    2. Remove all the tags and branches that do not contain that commit (see git tag --contains and git branch --contains)
    3. Rewrite the commit graph: git filter-branch --tag-name-filter cat -- --all

    Or, you can also parse the git-dump.dat file (output of cvs2git in git fast-import format) and remove old commits, tags, and branches from there.