GemFire 7 and later
The purpose of this article is to recognize how the eviction-threshold and critical-threshold impact the distributed system. There are pros and cons associated with these settings, and while they can provide added value, it is important to know when you are in danger of negatively impacting your distributed system.
These resource manager settings are used to protect the system from running out of memory. In the perfect environment with sufficient resources, these settings may not come into play at all. If you have sufficient heap capacity in your environment to handle ongoing load, bursts of activity, even the failure of other nodes that could put pressure on the remaining nodes, then the incidents this article is attempting to prevent or remedy may not apply to you.
That said, perhaps you are constantly having the battle where increases in data or load are negatively impacting your environment, increasing memory pressure. This, in turn, may drive eviction in your environment, and if this becomes insufficient, you may surpass your critical-threshold. Going above the critical-threshold must be avoided at all costs in the GemFire distributed system. While it can protect you against failing fast due to running out of memory in your heap, any sustained length of time above this threshold will likely cause a variety of issues including the removal of a member from the distributed system.
If you have had a member kicked out of the distributed system due to not being responsive, this article may apply to your environment and help you to eliminate such incidents. If you have seen this log message anywhere in your environment, please consider the topics in this article to proactively manage and eliminate such issues:
[error 2015/08/26 22:22:04.522 UTC gemfire-node-49001 tid=0x49] Member: 10.2.46.202(gemfire-node-49001:1423):53178 above critical heap threshold
A warning sign that you are approaching this danger is when you surpass your eviction-threshold, if you system uses this feature. A similar message exists for surpassing the eviction-threshold:
[info 2015/08/26 17:39:11.030 UTC gemfire-node-49001 tid=0x49] Member: 10.2.46.202(gemfire-node-49001:1423):53178 above eviction threshold
Another indication of surpassing these thresholds, or having memory issues in your environment, is the LowMemoryException:
com.gemstone.gemfire.cache.LowMemoryException: Region: /gemFireRegion cannot process operation on key: XYZ because member [10.2.46.202(gemfire-node-49001:1423):53178] is running low on memory
The key is managing your data, making sure that you have sufficient capacity to handle bursts of activity, including the failure of a member of the distributed system. If you see any evidence of the above messages in your logs at any time, you may be subject to some incidents that could negatively impact your GemFire distributed system.
In general, most would prefer to have sufficient heap capacity in the cache for all objects that need to be accessed. During times when this is not possible, due to bursts, or whatever is driving an increase in memory footprint, eviction can kick in for regions so configured. This article is not going to get into the details of what eviction algorithm to use, as GemFire does offer some alternatives here. With any option, the ultimate result is that the objects get destroyed, or, with overflow, the value of the objects get overflowed, while the key and entry remain in memory.
The following include some of the issues you may have by not using eviction optimally in your environment:
- cpu spikes due to excessive or prolonged eviction
- Failure to restart due to overflowed capacity exceeding heap capacity
- heap growth even with eviction-only value gets evicted
With too much eviction going on during such bursts, it is possible that an increase in cpu consumption drives some bad response times in the system. If extreme cpu pressures are sustained performing eviction, the distributed system may be negatively impacted, including the loss of a node.
One other issue with eviction and persistence, is that you must be aware of the amount of data being evicted. If a node is up and running, and you have a high percentage of that data that you are needing to overflow and persist, you may be subject to some issues upon restart of that member.
The member may have been up and running fine in this case, but if the member gets stopped, an attempt to restart may encounter issues given all of the persisted/evicted data that is only on disk. This is due to how the restart processing handles the recovery during gii (get initial image). If you had so many entries on disk, from regions configured for eviction, then we may run out of heap during restart if the space consumed by those evicted entries (and the total of the other regions entries) are greater than the heap space available in memory.
This is due to the fact that, simply put, we do not handle eviction the same during restart as we do during ongoing operations. We do not "evict" any data during early phases of restart, even if a region is configured for eviction. So, it is possible that we run out of heap.
If you do run into a situation where you are having difficult during restart, when you have evicted and persisted so much data, there are some possible options to help you restart. This includes moving all of your largest regions configured for eviction to the end of your cache.xml. The reason this may work is that, although we do not evict values during this early phase of restart, we will stop creating new values in memory once we surpass the configured eviction threshold. By moving such regions to the end, you have a very good chance of starting up successfully. That said, if you run into this situation you are in desparate need for additional heap resources because you are not sized properly.
While GemFire provides this capability to evict values and potentially relieve some memory pressure, using eviction as part of your basic strategy to store lots of data on disk can have negative side effects as just discussed.
Finally, you could get into perpetual eviction mode if you are not sized appropriately. With overflow, we only evict the value of the object, so the key and entry and other overhead continues to consume and even grow the heap. It is possible therefore to put an entry, and immediately have that value be evicted due to being above the eviction-threshold. Yet, the memory footprint grew due to the new key and entry overhead. This can put you in danger of surpassing the critical-threshold which is discussed next.
Please realize that eviction generates garbage, and so the heap footprint is only reduced by eviction when the garbage collection kicks in. Many use the CMS InitatingOccupancyFraction to control when eviction kicks in. This should generally be 5-10% below the eviction-threshold so that we are not unnecessarily evicting data when garbage could be collected. Whether to use 5% or 10% is relative to how big your heap is. If you are using 40gb or more heaps, you can configure your OFraction to be only 5% lower than your eviction. We always recommend testing these settings in your environment to reach optimal performance.
This setting is a last effort to protect your heap from running out of memory. For whatever reason, your heap is in danger, and so we provide a mechanism to essentially stop GemFire in its tracks so that ongoing GemFire processing does not continue to consume even more resources when the system is so low. The hope is that by pausing GemFire processing for a very short duration, we will give garbage collection a chance to collect all garbage and bring us back to much lower levels of heap consumption
This is very dangerous territory, however, and must be avoided in your environment if you want to avoid negative symptoms in your GemFire cluster. Members can appear unresponsive and get kicked out by other members if the memory footprint stays above the critical-threshold for any sustained period of time.
The likely issue here is that you simply are consuming much more heap than you had planned, and even eviction is unable to stop the growing memory footprint. Perhaps a node failed, and the remaining nodes took over some addition burden related to partitioned regions (if recovery-delay set > 0ms). One way or another you need to add capacity somehow if you start seeing these messages above. Some options are as follows:
- Increasing Heap via Xmx,Xms
- Increase critical-threshold to higher percentage
- Lower eviction-threshold
One factor to weigh is how much memory exists between your critical-threshold setting and your total tenured heap capacity. The critical-threshold and eviction-threshold are percentages of the old generation heap. If you are configured to use 90% critical-threshold, for example, and your tenured heap is 30gb, that leaves 3gb of memory that is only used to protect you from running out of memory. It is potentially wasted in your GemFire environment and perhaps is too much space to protect you from running out of memory. Perhaps 95% critical-threshold would suffice, providing 1.5gb of headroom protection from running out of memory, while also providing 1.5gb more room in your tenured heap before GemFire hits critical-threshold and starts blocking activity in the system.
Lowering the eviction-threshold would also serve to add capacity by evicting more values earlier and perhaps protect you from hitting the critical-threshold. Again, make sure to have the CMS OFraction set appropriately to give eviction enough time to eliminate garbage in the system prior to hitting either threshold.