GemFire 7 and later
The purpose of this article is to explain why you may be getting a LowMemoryException in your environment. Steps to avoiding such exceptions will also be included to proactively manage your GemFire environment.
The LowMemoryException is a warning in your environment that you are above the critical-threshold. Any sustained amount of time above this threshold will surely cause the distributed system to cause some negative impact in your environment, including the removal of one or more members of the distributed system. This, in turn, would place heavy burden on the remaining nodes to handle the load.
[severe 2015/08/26 22:22:31.308 UTC gemfire2 tid=0xd6b] UnExpected exception during function execution on local node Distributed Region com.gemstone.gemfire.cache.LowMemoryException: Region: /MyRegion cannot process operation on key: 5518157 because member [10.0.0.20(gemfire2:1423):53178] is running low on memory at com.gemstone.gemfire.internal.cache.LocalRegion.checkIfAboveThreshold(LocalRegion.java:5726) at com.gemstone.gemfire.internal.cache.LocalRegion.checkIfAboveThreshold(LocalRegion.java:5707) at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5647) at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375) at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118) at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5034) at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1731) at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1713) at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
Another example of experiencing this in your environment is during queury execution. The message and stack trace are slightly different, but the root of the issue is the same. You are over the critical-threshold and it is impacting your operations and your entire distributed system. Here is a small piece of this exception:
com.gemstone.gemfire.cache.query.QueryExecutionLowMemoryException: Query execution canceled due to memory threshold crossed in system, memory used: 44,016,607,584 bytes. at com.gemstone.gemfire.cache.query.internal.DefaultQueryService.newQuery(DefaultQueryService.java:105)
The operation being performed is really irrelevant to the issue at hand. Ultimately, it is determined that the level of memory consumption has surpassed the critical-threshold. The focus then needs to turn to the cause, and how to avoid or eliminate hitting this issue in your environment.
If you surpass this configured threshold in your environment, all activity that might add data to the cache is refused, trying to give time for eviction and garbage collection to reduce the memory footprint. This JVM, other JVMs in the distributed system, and all clients in the system receive a LowMemoryException for operations that would add to this critical member's heap consumption. Activities that fetch data or reduce data are allowed to proceed. For the list of refused operations, see the Javadocs for the ResourceManager method setCriticalHeapPercentage.
For a more detailed discussion on the critical-threshold, and how it works with eviction and protect the system from running out of memory, you may also want to read this article related to using those thresholds in your environment.
If you see either of these exceptions, it may have to do with increased load for one reason or another. Perhaps you are running a long query. Perhaps another node failed and this node has taken additional memory footprint. Whatever the reason, you need to assess whether you are having some temporary issue, or whether you need to resize based upon any changes in your environment.
Has your data changed in any way that could lead to consuming additional capacity? Has your load increased over time with corresponding increased capacity of the overall system? Is your critical-threshold set too low? Do you have additional capacity available to you in the JVM to increase the critical-threshold, or, better yet, to increase the total heap space? In general, hitting any of these LowMemoryExceptions implies that these questions need to be asked and answered.
Here is a brief list of ways to resolve this issue in the short term. Long term, you should go through the exercise of sizing your data, your expected memory footprint, build in the necessary overhead to handle bursts of activity, an increase in clients, failures of other nodes, and establish a heap size that is necessary to avoid hitting the critical-threshold completely, no matter what burst of activity may occur in the DS.
- Increase Tenured Heap by increasing Total Heap
- Increase critical-threshold
- Decrease eviction-threshold
- Use smaller queries that don't generate so much garbage
- Lower CMS Initiating Occupancy Fraction if garbage not collected quickly enough
- Is there a potential memory leak? Gather a heap dump
- Is load balanced across all the members evenly? Perhaps simulate a rebalance.
Increasing total heap obviously provides additional capacity, such that you might avoid surpassing the critical-threshold. You could potentially lower the NewSize if it is not possible to increase total heap due to physical memory on the server. That said, use caution when changing the NewSize as that can have negative impact as well. Another option is to increase your critical-threshold. If your critical-threshold is currently set to 90%, for example, you may be able to increase this to 95%, subsequently adding 5% additional capacity to your tenured space. For larger heaps, this is likely fine, still providing enough overhead protection prior to running out of memory on the box. For very small heaps, setting this threshold to 95% may not be enough overhead, such that you could get an OOME before garbage collection was able to bring you down below the safe threshold. Regarding the garbage collection, if you are passing through the critical-threshold due to garbage collection being invoked too late, you could always change this to lower percentages, such that all garbage has been collected prior to hitting critical. If you do ever hit the critical-threshold, it should not be due to garbage in the system.
The one exception to that statement may be related to longer running queries that generate a lot of garbage. If the memory pressure increases while the query is running and passes the threshold, you need to consider adding capacity or altering the query to limit the increase in pressure on the memory footprint.