[case9] Analysis of Frequent GC (Allocation Failure) and young gc with Too Long Time

  case, jvm

Order

This article mainly analyzes a case of frequent GC (Allocation Failure) and young gc with too long time.

Symptoms

  • Gc throughput percent decreased gradually from 99.96% to below 99%, reaching 98% and the lowest point reaching 94%
  • Young gc time is gradually increasing, from the average of more than ten milliseconds, to break through 50 and then 100,150,200,250.
  • In 8.5 days, more than 9,000 gc’s occurred, of which full gc was 4 times, with an average of nearly 8 seconds. most of them were young gc (Allocation failure is the main factor.), averaging more than 270 milliseconds, with a maximum of nearly 7 seconds
  • The average object creation rate is 10.63 mb/sec, the average promotion rate is 2 kb/sec, the cpu utilization rate is normal, and there is no obvious surge.

Jvm parameters

-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:+UseAdaptiveSizePolicy -XX:MaxHeapSize=2147483648 -XX:MaxNewSize=1073741824 -XX:NewSize=1073741824 -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps

Jdk version

java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

full gc

27.066: [Full GC (Metadata GC Threshold) [PSYoungGen: 19211K->0K(917504K)] [ParOldGen: 80K->18440K(1048576K)] 19291K->18440K(1966080K), [Metaspace: 20943K->20943K(1069056K)], 0.5005658 secs] [Times: user=0.24 sys=0.01, real=0.50 secs] 
100.675: [Full GC (Metadata GC Threshold) [PSYoungGen: 14699K->0K(917504K)] [ParOldGen: 18464K->23826K(1048576K)] 33164K->23826K(1966080K), [Metaspace: 34777K->34777K(1081344K)], 0.7937738 secs] [Times: user=0.37 sys=0.01, real=0.79 secs]
195.073: [Full GC (Metadata GC Threshold) [PSYoungGen: 24843K->0K(1022464K)] [ParOldGen: 30048K->44782K(1048576K)] 54892K->44782K(2071040K), [Metaspace: 58220K->58220K(1101824K)], 3.7936515 secs] [Times: user=1.86 sys=0.02, real=3.79 secs] 
242605.669: [Full GC (Ergonomics) [PSYoungGen: 67276K->0K(882688K)] [ParOldGen: 1042358K->117634K(1048576K)] 1109635K->117634K(1931264K), [Metaspace: 91365K->90958K(1132544K)], 22.1573804 secs] [Times: user=2.50 sys=3.51, real=22.16 secs]

Four fullgcs can be found, the first three are caused by Metadata GC Threshold, and only the last one is caused by Ergonomics.

Full GC (Metadata GC Threshold)

Java8 is used here, and the parameters do not specify the size and upper limit of metaspace explicitly. check it out.

jstat -gcmetacapacity 7
   MCMN       MCMX        MC       CCSMN      CCSMX       CCSC     YGC   FGC    FGCT     GCT
       0.0  1136640.0    99456.0        0.0  1048576.0    12160.0 38009    16  275.801 14361.992
  • Ignore the following FGC because the log analyzed is only one quarter of them.
  • It can be seen here that there is more than one g in mcmx (maximum metaspace capacity (kb)) while MC(Metaspace capacity (kB)) is only about 97M. why is Full GC (Metadata GC Threshold) caused?

Related parameters

  • -xx: metaparacesize, the initial space size (also the initial threshold, namely the initial high-water-mark), reaching this value will trigger garbage collection for type unloading, and GC will adjust this value: if a large amount of space is released, the value will be appropriately reduced; If a small amount of space is released, the value should be appropriately increased when the MaxMetaspaceSize is not exceeded.
  • -xx: maxmetaaspacesize, the maximum space, is unlimited by default, depending on the space capacity of the local system.
  • -xx: MinMetaspacefreeratio, the percentage of minimum metaaspace remaining space capacity after GC (That is, the maximum occupancy size of metadata in the current allocation size), if the idle ratio is less than this parameter (I.e. exceeding the maximum occupancy size), meta space will be expanded.
  • -xx: maxMetaspacefreeratio, percentage of maximum metaaspace remaining space capacity after GC (That is, the minimum occupancy size of metadata in the current allocation size), if the idle ratio is greater than this parameter (I.e. less than the minimum occupancy size), meta space will be shrunk.

Since there is no setting, the default value on the machine is:

java -XX:+PrintFlagsFinal | grep Meta
    uintx InitialBootClassLoaderMetaspaceSize       = 4194304                             {product}
    uintx MaxMetaspaceExpansion                     = 5451776                             {product}
    uintx MaxMetaspaceFreeRatio                     = 70                                  {product}
    uintx MaxMetaspaceSize                          = 18446744073709547520                    {product}
    uintx MetaspaceSize                             = 21807104                            {pd product}
    uintx MinMetaspaceExpansion                     = 339968                              {product}
    uintx MinMetaspaceFreeRatio                     = 40                                  {product}
     bool TraceMetadataHumongousAllocation          = false                               {product}
     bool UseLargePagesInMetaspace                  = false                               {product}

It can be seen that MinMetaspaceFreeRatio is 40, MaxMetaspaceFreeRatio is 70, MetaspaceSize is 20M, and Full GC (Metadata GC Threshold) is mainly divided into three times

  • For the first time, [metaspace: 20943k-> 20943k (1069056k)]
  • Second time, [metaspace: 34777k-> 34777k (1081344k)]
  • Third time, [metaspace: 58220k-> 58220k (1101824k)]

It can be seen that the threshold of metaspace is constantly adjusted dynamically. As for the logic of specific adjustment, the official document seems not to mention it, so we will not delve into it for the time being. As long as the Max value is not exceeded, there will be no fatal impact. however, for low-latency applications, gc time consumption caused by dynamic adjustment should be avoided as much as possible, which can be solved by optimizing calculation and setting initial threshold.

Full GC (Ergonomics)

The reason that you can get to full gc here is Ergonomics, because the UseAdaptiveSizePolicy is turned on, and the jvm makes self-adaptive adjustment to cause full gc.

GC (Allocation Failure)

After analyzing full gc, let’s look at young gc. 99% of log are young gc caused by GC (Allocation Failure). Allocation Failure indicates minor gc caused by applying to young generation(eden) for space for a new object, but the appropriate space left by young generation(eden) is not large enough.

-XX:+PrintTenuringDistribution

Desired survivor size 75497472 bytes, new threshold 2 (max 15)
- age   1:   68407384 bytes,   68407384 total
- age   2:   12494576 bytes,   80901960 total
- age   3:      79376 bytes,   80981336 total
- age   4:    2904256 bytes,   83885592 total
  • This Desired survivor size indicates that the maximum space allowed in the survivor area is 75497472 bytes
  • The following list of objects shows the age distribution of survivor after this gc. the total size is 83885592 > 75497472, while the age1 size is 68407384 < 75497472, so the new threshold becomes 2 (Acting on next gc)。 If the object is not released in the next gc, the object that exceeds the threshold will be promoted to old generation.

Age list is empty

59.463: [GC (Allocation Failure) 
Desired survivor size 134217728 bytes, new threshold 7 (max 15)
[PSYoungGen: 786432K->14020K(917504K)] 804872K->32469K(1966080K), 0.1116049 secs] [Times: user=0.10 sys=0.01, real=0.20 secs] 

There is no distribution of age objects under the Desired survivor size line here, which means that there are no surviving objects with age less than max threshold in the current survivor area after this gc. There is no output here, which means that after gc recycles objects this time, no surviving objects can be copied to the new survivor area.

Survivor has examples of objects after gc

jstat -gcutil -h10 7 10000 10000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0.00  99.99  90.38  29.82  97.84  96.99    413  158.501     4   14.597  173.098
 11.60   0.00  76.00  29.83  97.84  96.99    414  158.511     4   14.597  173.109
 11.60   0.00  77.16  29.83  97.84  96.99    414  158.511     4   14.597  173.109
  0.00  13.67  60.04  29.83  97.84  96.99    415  158.578     4   14.597  173.176
  0.00  13.67  61.05  29.83  97.84  96.99    415  158.578     4   14.597  173.176
  • Young generation = eden+S1; before ygc; After ygc, young generation = eden+S0
  • Observation shows that the old generation space has not changed since ygc, indicating that no object has been promoted to old generation in ygc this time.
  • After gc, the surviving object was moved to another survivor area
  • Since the sample is taken every 10 seconds, there is a delay, that is, immediately after gc, new objects are allocated in the eden area, so the eden area seen here is occupied by objects.

real time > usr time + sys time

722914.974: [GC (Allocation Failure) 
Desired survivor size 109576192 bytes, new threshold 15 (max 15)
[PSYoungGen: 876522K->8608K(941568K)] 1526192K->658293K(1990144K), 0.0102709 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
722975.207: [GC (Allocation Failure) 
Desired survivor size 103284736 bytes, new threshold 15 (max 15)
[PSYoungGen: 843168K->39278K(941568K)] 1492853K->688988K(1990144K), 0.3607036 secs] [Times: user=0.17 sys=0.00, real=0.36 secs] 

The real time time of gcs with more than 300 times is greater than usr time+sys time.

  • Real: refers to the WallClock Time elapsed from the start to the end of the operation
  • User: refers to the CPU time consumed in user mode;
  • Sys: Refers to CPU time consumed in kernel state.

Wall clock time includes various non-operational waiting times, such as waiting for disk I/O and waiting for thread blocking, while CPU time does not include these times. However, when the system has multiple CPUs or multi-cores, multi-thread operation will superimpose these CPU times, so it is completely normal to see that user or sys time exceeds real time.

User+sys is the actual time spent by the CPU. note that this value counts the time on all CPUs. if the process works in a multi-threaded environment an d the multi-threaded time is superimposed, this value will exceed the value recorded by real, that is, user+sys >= real.

More than 300 real time times here are greater than usr time+sys time, indicating that there may be two problems, one is IO intensive operation, the other is cpu (Distribution) is not enough.

New Generation Garbage Recovery Mechanism

  • The new object tries to allocate on the stack, and if it cannot try TLAB allocation again, it will consider whether to directly bypass eden area to allocate space in the old generation (-xx: PretensisizeThreshold sets the threshold for large objects to enter the old generation directly. When the object size exceeds this value, it will be allocated directly to the old generation.), if not, finally consider applying for space at eden.

  • Apply to eden for space to create a new object. eden does not have suitable space, thus triggering minor gc
  • Minor gc processes the surviving objects in eden area and from survivor area.

    • If these objects reach the threshold age, they will be directly promoted to the older generation.
    • If the object to be copied is too large, it will not be copied to survivor, but will enter the old generation directly.
    • If there is not enough space in the to survivor area/or there is not enough space in the replication process, survivor overflow will occur and directly enter the elderly generation.
    • Otherwise, if there is enough space in the to survivor area, the surviving object is copied to the to survivor area.
  • At this time, the remaining objects in eden area and from survivor area are garbage objects, directly erasing and recycling, and the released space becomes a new allocatable space
  • After minor gc, if eden space is sufficient, the new object allocates space in eden; If eden space is still insufficient, the new object allocates space directly to the older generation.

Summary

From the above analysis, it can be seen that young generation seems a bit big, ygc takes a long time. In addition, the survivor space is basically empty after each ygc, which shows that the newly-born objects are generated quickly and have a short life cycle. The originally designed survivor space is not used. Therefore, you can consider reducing the size of young generation, or try G1 instead.

There are several points about -XX:+PrintTenuringDistribution, which should be made clear:

  • Which area of this print has an object distribution (survivor)
  • Is it printed before gc or after gc (Print after gc)
  • When a newborn object first arrives at survivor, does its age count as 0 or 1

The age of the object is the number of MinorGCs he has experienced. when the object is first allocated, the age is 0. after the first MinorGC, if it has not been recovered, the age is +1. since it is the first minor gc, it enters the survivor area. Therefore, the age of the subject when entering the survivor area for the first time is 1.

  • New threshold dynamic adjustment

If the total size of the lower age is larger than the Desired survivor size, it means that the survivor space overflows, is filled, and then the threshold is recalculated.

doc