Why g1 needs both of these data structures ?
My understanding is:
Why can't RS hold all of the information instead of using CT ?
Is it because otherwise there would be too much of duplicate data stored in different RSs ? For example - Young generation has regions A and B, objects in both these regions have same outside reference from old generation. In this case both RSs associated with region A and B will store this information which is redundant, is this explanation correct ?
Let's first put some things in correct order. A Card Table
shows were there might be incoming references into this region. It does sort of a "hash". For a single byte in the card table - there are many bytes in the old region. It's like saying that if (in theory) you have a card table that looks like this (a single card is marked as "dirty")
0 1 0 0 0 0
For that 1
(dirty card) there is a certain mapping in the old region that needs to be scanned also, when scanning the young regions.
0 1 0 .....
0 - 512 512-1024 ...........
So that dirty card corresponds to a certain portion (from 512 to 1024 bytes) in the old generation, that will be scanned also, as part of the young generation scanning.
G1 has regions and now you need to understand how CT
and RS
work in tandem. Let's say GC scans Region1
at this point in time, it takes everything that is alive from this region and copies into Region2
. At the same time Region2
has references to Region3
. If we use CT
, Region2
will be marked as "dirty" (by placing the specific card in the card table).
Next cycle wants to scan Region3
. How does Region3
knows if there are other regions that might point into it? And indeed there are such cases: Region2
has references into Region3
. Well, it could look into the CT
and check every single dirty card (and thus every region that these dirty cards corresponds to) and see if there are references from any of those regions in here. Think about it: in order to clear Region3
, G1
has to look at the entire CT
. In the worst case scenario, it should be scanning the entire heap - for a single region.
Thus: Remembered Sets
. These data structures are populated based on what CT
knows about, by an asynchronous thread. When Region2
is marked as dirty, an asynchronous thread will start computing its RS
. When Region3
needs to be scanned, its RS
will only have an entry with Region2
.
Thus, in order to scan a single region, G1
only needs to look into that particular RS
.