I backup my CSS userstyles to a git repo like so:
❯ fd
stylus-2021-05-18.json
stylus-2021-05-20.json
These backup files are obviously mostly the same, i.e., stylus-2021-05-18.json
is the past history of stylus-2021-05-20.json
. How is this handled by git
?
Obviously, I could just rename the files to stylus.json
and let git
handle the version control completely, but I was wondering if git
is smart enough that it could work with these files automatically.
Commits are created as full file snapshots, always, but garbage collection creates commit packs, which efficiently stores similar blobs using diff compression, whether they're from the same file or not.
My understanding of Git storing "diffs" rather than full files was all wrong. After having done some readings and some experiments, I see that it doesn't matter if you modify a file or create a copy of a file, when you commit the change or the new file, Git creates a brand new blob, every time.
But, that's pretty inefficient, because you end up with a lot of different copies of the same text, with small diffs between blobs. That problem gets fixed when Git creates packs. I don't fully understand how Git searches for things to pack, but inside a pack, it will store some blobs as whole blobs, and some others as diffs from other blobs.
# create a big file and commit it
seq 1 1000000 | shuf > bigfile
git add bigfile
git commit -m'bigfile'
At this point, find .git -ls
shows me one big blob (3.5MB) storing this 6.9MB file.
# modify the big file and commit the change
echo change >> bigfile
git commit -m'modify bigfile' bigfile
At this point, find .git -ls
shows me two big blobs, each about 3.5MB. Seems pretty inefficient to me, but read on...
# Add another big file, similar to the first one, and commit it
cp bigfile bigfile2
echo some trivial change >> bigfile2
git add bigfile2
git commit -m'bigfile2'
Things don't get better: find .git -ls
shows me three big blobs, each about 3.5MB!
Now, at some point when you push, Git might pack your sandbox, but we can force that to happen right now: run git gc
. That's not just garbage collection, as I incorrectly thought, it's also creating commit packs. After running git gc
, find .git -ls
now reports a single pack of about 3.2MB. So my three big blobs were identified as similar, better compressed, and stored efficiently. I think this is called "diff compression".
Online posts I just read to answer this question: