gitversion-controlgit-history

Git: How are copies of a file with a shared history handled?


I backup my CSS userstyles to a git repo like so:

❯ fd                                                                                            
stylus-2021-05-18.json
stylus-2021-05-20.json

These backup files are obviously mostly the same, i.e., stylus-2021-05-18.json is the past history of stylus-2021-05-20.json. How is this handled by git?

Obviously, I could just rename the files to stylus.json and let git handle the version control completely, but I was wondering if git is smart enough that it could work with these files automatically.


Solution

  • TL;DR

    Commits are created as full file snapshots, always, but garbage collection creates commit packs, which efficiently stores similar blobs using diff compression, whether they're from the same file or not.

    Intro

    My understanding of Git storing "diffs" rather than full files was all wrong. After having done some readings and some experiments, I see that it doesn't matter if you modify a file or create a copy of a file, when you commit the change or the new file, Git creates a brand new blob, every time.

    But, that's pretty inefficient, because you end up with a lot of different copies of the same text, with small diffs between blobs. That problem gets fixed when Git creates packs. I don't fully understand how Git searches for things to pack, but inside a pack, it will store some blobs as whole blobs, and some others as diffs from other blobs.

    Experiment

    # create a big file and commit it
    seq 1 1000000 | shuf > bigfile
    git add bigfile
    git commit -m'bigfile'
    

    At this point, find .git -ls shows me one big blob (3.5MB) storing this 6.9MB file.

    # modify the big file and commit the change
    echo change >> bigfile
    git commit -m'modify bigfile' bigfile
    

    At this point, find .git -ls shows me two big blobs, each about 3.5MB. Seems pretty inefficient to me, but read on...

    # Add another big file, similar to the first one, and commit it
    cp bigfile bigfile2
    echo some trivial change >> bigfile2
    git add bigfile2
    git commit -m'bigfile2'
    

    Things don't get better: find .git -ls shows me three big blobs, each about 3.5MB!

    Now, at some point when you push, Git might pack your sandbox, but we can force that to happen right now: run git gc. That's not just garbage collection, as I incorrectly thought, it's also creating commit packs. After running git gc, find .git -ls now reports a single pack of about 3.2MB. So my three big blobs were identified as similar, better compressed, and stored efficiently. I think this is called "diff compression".

    References

    Online posts I just read to answer this question: