I just read some blogs about G1 algorithm.
The usage of remembered-set is confused to me.
Here is what I think:
Since we can use DFS to walk through every reference from GC-Roots, why do we need remembered-set?
Cause all the blogs to say the reason why we use remembered-set is we don't need to check every region to see if there is an object that is referenced by GC-Roots
You need to understand what Card Table
is first, IMO. How do you scan only young generation
region and clean it, if there are references from old generation
back to young
? You need to "track" exactly where these connections are present - so while scanning young generation
you could clean it without breaking the heap.
Think about it: you can't mark for removal an Object A
that it is in young generation now, if there is a reference B
to it, from old generation
. But remember that right now - you are in the young collection only. So to track these "connections" a Card Table
is implemented. Each bit from this card table says that a certain portion of the old generation is "dirty", meaning also scan that portion from the old generation while scanning young.
Why do you need that? The entire point of scanning young is to scan a little piece of the heap, not all. This card table
achieves that.
G1
has regions. What if you are scanning regionA
and you see that it has pointers to some other regionB
? Simply putting this information in the Card Table
is not enough. Your card table will only know about regionA
, and next time you scan regionB
- how do you know you are supposed to scan regionA
also? If you don't do that, obviously the heap integrity is broken.
As such : remembered sets
. These sets are populated by an asynchronous thread: it scans the card table
and according to that information it also scans where these "dirty" regions have pointers to. It keeps track of that regionA -> regionB
connection. Each region has it's own remembered set
.
So when you reach the point that GC needs to happen, when scanning regionB
you also look at it's remembered set
and find out that you also need to scan regionA
.
In practice, this is why G1
became generational : these remembered sets
turned out to be huge. If you divide the heap in young
and old
, there is no need to keep the connections between young generations, you scan them all at once anyway, thus taking away the burned on the size of these sets. G1
wants to keep that 200ms
(default) promise - to do that, you need to scan young generation all at once (because there is no connection between regions in remembered sets
and otherwise heap integrity is gone), but at the same time if you make young generation small - the size of remembered sets
will be big.
As such, touching these settings is an engineering miracle, IMHO.