Is it possible to get a list of all git object hashes of blobs which have been added to the repository by a given commit hash using the git command line tools?
I already tried archiving this with the git plumbing tool git-diff-tree
. Maybe it's the wrong approach. Below is the best result I could get so far. But the (very long man page) documentation didn't help finding out how exactly the output has to be interpreted.
$ git diff-tree --no-commit-id 2b53d04dbb7cd35d030ddc59b13c0836a87daeb7
:100644 100644 03f15b592c7d776da37e3d4372c215b14ff8820f 6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8 M file1.ts
:100644 100644 b5083bdb9c31005ebd16835a0f49dc848d3f387a 4b7f9e6624a66fec0510d76823303017e224c9d7 M file2.ts
:100644 100644 368d64862e6aa2a0110f201c8a5193d929e2956d 0e51626a9866a8a3896489f497fbd745a5f4a9f2 M file3.ts
:040000 040000 c332b1e576af0dbb93cc875106bc06c3de6b74c8 f7f3478a9b0eaac85719699d97e323563a1b102b M some_folder
Do the first and second git object blob hashes show the old and new objects for the modified file respectively? In the worst case I could fetch that information by parsing the output.
My primary goal was to find a command line which works as below:
$ git <command> <option1> <option2> 368d64862e6aa2a0110f201c8a5193d929e2956d
6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8
4b7f9e6624a66fec0510d76823303017e224c9d7
0e51626a9866a8a3896489f497fbd745a5f4a9f2
Edit below in response to @torek
In response to the answer of @torek I want to be more clear about what my intentions are because he is absolutely right pointing out that new isn't nececessary new.
I am planning to use git rev-list --reverse <branch>
to get a a list of all commits on that branch in commit order. Then I want to visit every commit in this order and collect firstly seen blob hashes on this branch per commit.
The end result should be something like the following:
C:368d64862e6aa2a0110f201c8a5193d929e2956d
B:03f15b592c7d776da37e3d4372c215b14ff8820f
B:4b7f9e6624a66fec0510d76823303017e224c9d7
B:c332b1e576af0dbb93cc875106bc06c3de6b74c8
C:5521a02ce1bc4f147d0fa39a178512476764dd66
B:e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e
B:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
C:a3db5c13ff90a36963278c6a39e4ee3c22e2a436
B:4888920a568af4ef2d2f4866e75b4061112a39ea
.
.
.
C:
commit
B:
blob
If this isn't easily done it would be ok to do two passes. In the first pass blobs can be mentioned multipe times in different commits because of reasons you have pointed out:
I could then do a second pass piping the file through awk '!x[$0]++'
which will remove any duplicates. This wouldn't be very efficient but would get the result I want.
I hope I made my intentions clear now. Any thoughts?
Is it possible to get a list of all git object hashes of blobs which have been added to the repository by a given commit hash using the git command line tools?
Yes and/or no: you have to define precisely what you mean by added to the repository.
Suppose, for instance, that I start with a totally empty repository:
$ mkdir foo && cd foo && git init
Initialized empty Git repository in ...
Now I create README.md
and git add
it and commit:
$ echo for testing > README.md
$ git add README.md
$ git commit -m initial
[master (root-commit) 19278e9] initial
1 file changed, 1 insertion(+)
create mode 100644 README.md
README.md
is a blob and its hash ID is:
$ git rev-parse HEAD:README.md
43b18adf702be62761e3affd85c4c3ee5c396be7
Later, I write a new file:
$ echo for testing > newfile.txt
$ git add newfile.txt
$ git commit -m 'add new file'
[master 5521a02] add new file
1 file changed, 1 insertion(+)
create mode 100644 newfile.txt
If we look at this commit, we'll see the new file. If we look at it with git show --raw
we'll see it in the git diff-tree
format:
$ git show --raw
commit 5521a02ce1bc4f147d0fa39a178512476764dd66 (HEAD -> master)
Author: Chris Torek <chris.torek gmail.com>
Date: Fri Oct 18 14:10:55 2019 -0700
add new file
:000000 100644 0000000 43b18ad A newfile.txt
This seems like a blob that's been added to the repository, but wait, there's something awfully familiar about 43b18ad
:
$ git rev-parse HEAD:newfile.txt
43b18adf702be62761e3affd85c4c3ee5c396be7
Yes, that's the same hash ID as README.md
:
$ git ls-tree -r HEAD
100644 blob 43b18adf702be62761e3affd85c4c3ee5c396be7 README.md
100644 blob 43b18adf702be62761e3affd85c4c3ee5c396be7 newfile.txt
It's one blob, but two files. Is that really newly added?
If your answer to the above is "yes, it's new, even though it's old", that might answer this second question. If your answer is "no, it's not new", what about a commit that reintroduces a blob that was removed in a previous commit? Or, if two commits I
and J
made in parallel on two branches:
I <-- br1
/
...--G--H
\
J <-- br2
both introduce the same blob, which one actually adds it as all-new, and which one merely duplicates the other?
In general, if you want all new, you'll have to walk the entire commit graph, inspecting each commit's tree (see git ls-tree -r
), and select which commits first introduce a blob object ID that is not already in some earlier (parent-wise and/or date-and-time-wise) commit object. If you want "newly added as a new file in this commit", inspect the commit and its parent(s), perhaps using git diff-tree
or similar. Note that an all-new file has an all-zero mode in its parent, and a status letter of A
(added), while a file modified from the its parent has a status letter of M
(modified) and two nonzero hashes. A file nominally deleted—a file that existed in the parent, but no longer does in the child—has a status letter of D
(deleted). If you enable rename detection, you'll get R
status-es and similarity index values; you may want to disable this, or at least force the similarity testing to 100%.