gitgit-lfs

How does git LFS track and store binary data more efficiently than git?


I know that git LFS causes git to store a string "pointer" in a text file, and then git LFS downloads that target binary file. In this way, git repos are smaller on the remote git server. But, git LFS still has to store the binary files, so it seems to me that the storage locally (after a git lfs pull) is no different, and the combined sum of the remote git LFS server data plus the remote git data would still be similar.

What am I missing? How does git LFS efficiently track binary files?


Update (after additional learning since writing this question): don't use git lfs. I now recommend against using git lfs

See also:

  1. my comments below the answer I accepted
  2. my own answer I just added below
  3. My "Git LFS is evil" rant. Git LFS is slow, inefficient, online (as opposed to Git, which is offline), and evil. Read my full rant in my eRCaGuy_dotfiles repo.

I began with this question because I believed Git LFS was amazing and wonderful and I wanted to know how. Instead, I ended up realizing Git LFS was the cause of my daily workflow problems and that I shouldn't use it nor recommend it anymore.

Summary:

As I state here:

For personal, free GitHub accounts, it is way too limiting, and for paid, corporate accounts, it makes git checkout go from taking a few seconds to up to 3+ hours, especially for remote workers, which is a total waste of their time. I dealt with that for three years and it was horrible. I wrote a script to do a git lfs fetch once per night to mitigate this, but my employer refused to buy me a bigger SSD to give me enough space to do git lfs fetch --all once per night, so I still ran into the multi-hour-checkout problem frequently. It's also impossible to undo the integration of git lfs into your repo unless you delete your whole GitHub repo and recreate it from scratch.

Details:

I just discovered that the free version of git lfs has such strict limits that it's useless, and I'm now in the process of removing it from all my public free repos. See this answer (Repository size limits for GitHub.com) and search for the "git lfs" parts.

It seems to me that the only benefit of git lfs is that it avoids downloading a ton of data all at once when you clone a repo. That's it! That seems like a pretty minimal, if not useless, benefit for any repo which has a total content size (git repo + would-be git lfs repo) < 2 TB or so. All that using git lfs does is

  1. make git checkout take forever (literally hours) (bad)
  2. make my normally-fast-and-offline git commands, like git checkout now become online-and-slow git commands (bad), and
  3. act as another GitHub service to pay for (bad).

If you're trying to use git lfs to overcome GitHub's 100 MB max file size limit, like I was, don't! You'll run out of git lfs space almost instantly, in particular if anyone clones or forks your repo, as that counts against your limits, not theirs! Instead, "a tool such as tar plus split, or just split alone, can be used to split a large file into smaller parts, such as 90 MB each" (source), so that you can then commit those binary file chunks to your regular git repo.

Lastly, the "solution" on GitHub to stop using git lfs and totally free up that space again is absolutely crazy nuts! You have to delete your entire repo! See this Q&A here: How to delete a file tracked by git-lfs and release the storage quota?

GitHub's official documentation confirms this (emphasis added):

After you remove files from Git LFS, the Git LFS objects still exist on the remote storage and will continue to count toward your Git LFS storage quota.

To remove Git LFS objects from a repository, delete and recreate the repository. When you delete a repository, any associated issues, stars, and forks are also deleted.

I can't believe this is even considered a "solution". I really hope they're working on a better fix for it.

Suggestion to employers and corporations considering using git lfs:

Quick summary: don't use git lfs. Buy your employees bigger SSDs instead. If you do end up using git lfs, buy your employees bigger SSDs anyway, so they can run a script to do git lfs fetch --all once per night while they are sleeping.

Details:

Let's say you're a tech company with a massive mono-repo that is 50 GB in size, and binary files and data that you'd like to be part of the repo which are 4 TB in size. Rather than giving them insufficient 500 GB ~ 2 TB SSDs and then resorting to git lfs, which makes git checkouts go from seconds to hours when done on home internet connections, get your employees bigger solid state drives instead! A typical tech employee costs you > $1000/day (5 working days per week x 48 working weeks/year x $1000/day = $240k, which is less than their salary + benefits + overhead costs). So, a $1000 8 TB SSD is totally worth it if it saves them hours of waiting and hassle! Examples to buy:

  1. 8TB Sabrent Rocket M.2 SSD, $1100
  2. 8TB Inland M.2 SSD, $900

Now they will hopefully have enough space to run git lfs fetch --all in an automated nightly script to fetch LFS contents for all remote branches to help mitigate (but not solve) this, or at least git lfs fetch origin branch1 branch2 branch3 to fetch the contents for the hashes of their most-used branches.

See also

  1. Really insightful Q&A which also leans towards not using git lfs [even for remote repos]: Do I need Git LFS for local repos?
  2. What is the advantage of git lfs?
  3. My Q&A: How to resume git lfs post-checkout hook after failed git checkout
  4. My answer: How to shrink your .git folder in your git repo
  5. My Q&A: What is the difference between git lfs fetch, git lfs fetch --all, and git lfs pull?

Solution

  • When you clone a Git repository, you have to download a compressed copy of its entire history. Every version of every file is accessible to you.

    With Git LFS, the file data are not stored in the repository, so when you clone the repository it does not have to download the complete history of the files stored in LFS. Only the "current" version of each LFS file is downloaded from the LFS server. Technically, LFS files are downloaded during "checkout" rather than "clone."

    So Git LFS is not as much about storing large files efficiently as it is about avoid downloading unneeded versions of selected files. That history is often not very interesting anyway, and if you need an older version, Git can connect to the LFS server and get it. This is by contrast to regular Git which lets you checkout any commit offline.