I am experiencing infrequent long GC pauses with G1GC algorithm - 30+ seconds for a month. Once it happens, I restart my service and this delay won't happen again for next 1 month.
I am attaching GC log.
2016-06-15T02:30:17.407-0400: 1702455.276: Total time for which application threads were stopped: 0.2900480 seconds
2016-06-15T02:30:25.509-0400: 1702463.378: Total time for which application threads were stopped: 0.0018900 seconds
2016-06-15T02:30:29.817-0400: 1702467.686: [GC pause (young)
Desired survivor size 7340032 bytes, new threshold 15 (max 15)
- age 1: 1903912 bytes, 1903912 total
- age 2: 641440 bytes, 2545352 total
- age 3: 455856 bytes, 3001208 total
- age 4: 563544 bytes, 3564752 total
- age 5: 1873368 bytes, 5438120 total
- age 6: 326024 bytes, 5764144 total
- age 7: 299144 bytes, 6063288 total
- age 8: 304632 bytes, 6367920 total
- age 9: 273160 bytes, 6641080 total
- age 10: 309152 bytes, 6950232 total
- age 11: 156824 bytes, 7107056 total
- age 12: 135064 bytes, 7242120 total
, 0.1199490 secs]
[Parallel Time: 118.2 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 1702467686.7, Avg: 1702467686.8, Max: 1702467686.9, Diff: 0.2]
[Ext Root Scanning (ms): Min: 18.4, Avg: 20.2, Max: 24.6, Diff: 6.2, Sum: 80.7]
[Update RS (ms): Min: 75.5, Avg: 79.5, Max: 80.8, Diff: 5.3, Sum: 317.8]
[Processed Buffers: Min: 52, Avg: 61.0, Max: 70, Diff: 18, Sum: 244]
[Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
[Object Copy (ms): Min: 17.7, Avg: 18.1, Max: 18.6, Diff: 0.8, Sum: 72.3]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[GC Worker Other (ms): Min: 0.1, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.4]
[GC Worker Total (ms): Min: 117.8, Avg: 117.9, Max: 118.0, Diff: 0.2, Sum: 471.6]
[GC Worker End (ms): Min: 1702467804.7, Avg: 1702467804.7, Max: 1702467804.7, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Clear CT: 0.2 ms]
[Other: 1.6 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.3 ms]
[Ref Enq: 0.0 ms]
[Free CSet: 0.4 ms]
[Eden: 100.0M(100.0M)->0.0B(100.0M) Survivors: 9216.0K->9216.0K Heap: 2164.6M(2198.0M)->2065.5M(2198.0M)]
[Times: user=0.48 sys=0.00, real=37.62 secs]
2016-06-15T02:31:07.438-0400: 1702505.307: Total time for which application threads were stopped: 37.6223210 seconds
2016-06-15T02:31:07.443-0400: 1702505.312: Total time for which application threads were stopped: 0.0034690 seconds
2016-06-15T02:31:07.446-0400: 1702505.315: Total time for which application threads were stopped: 0.0018140 seconds
2016-06-15T02:31:07.451-0400: 1702505.320: Total time for which application threads were stopped: 0.0020130 seconds
2016-06-15T02:31:07.453-0400: 1702505.322: Total time for which application threads were stopped: 0.0015850 seconds
I am using below G1GC flags :
-server -d64 -verbose:gc -Xms2g -Xmx4g -XX:+UseG1GC
-XX:MaxGCPauseMillis=1500 -XX:G1HeapRegionSize=2
-XX:+PrintFlagsFinal -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2
-XX:PermSize=64m -XX:MaxPermSize=256m -XX:+DisableExplicitGC
-XX:+PrintGCDateStamps -XX:+DoEscapeAnalysis -XX:+PrintTenuringDistribution
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=$PP_HOME -XX:+UseCompressedOops -XX:+AggressiveOpts"
Machine configuration:
Red Hat Enterprise Linux Server release 5.11
Dual core CPU
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
How can we find the root cause for long GC from above statements?
Update to the latest version of Java 8 - you should not be using G1 with Java 7 because it was not production ready at that time. Odds are very good you have hit a now-patched bug. Yes, there are ways to get OpenJDK 8 onto a RHEL 5 box. Yes, your application will increase massively in speed.
Also, this:
-XX:MaxGCPauseMillis=1500
and this:
-Xms2g -Xmx4g
Would suggest you aren't running in the sort of large-heap, low-latency environment G1 is designed for. With a JRE that old, you may be better off with the default (ParallelGC) and maybe tuning a few options to reduce pause times (mileage may vary). Or CMS, which was more mature then and with tuning can be quite stable.
And for goodness sakes, if your app is critical enough that a 30 second GC pause every month is a problem then you can probably afford a system with RHEL 7 and more than dual cores.