GemFire 7 and later
The purpose of this article is to discuss potential distributed deadlock scenarios, how to handle them, and in some cases what you may be able to do to limit the scope of the impact to your GemFire distributed system (DS). This will certainly not cover all possible scenarios, especially when we do not expect to hit such deadlocks with the product. Perhaps this article will be expanded over time if additional scenarios are found where the customer can help themselves in limiting the scope of the business impact.
Your distributed system is not being responsive in some way. Perhaps some gfsh command is not returning. Perhaps you just started a new node, joining the cluster, and you starting encountering issues in your environment that cascades across the DS. Whatever the scenario may be, the general goal is to get out of the situation with minimal business impact. Sometimes, it may seem like all nodes are affected, such that a restart of the entire cluster is the only alternative. In some cases, this may be true. Sometimes, however, you may be able to determine a course of action that allows the bulk of the DS to continue running and recover to a state of being stable. Perhaps you have log messages indicating that some node or subset of nodes are not being responsive. Hopefully, this article will assist you in transitioning your system to a better state.
Once you determine that a deadlock may be possible, due to whatever symptoms you are seeing, the most important thing to do, first and foremost, is to gather thread dumps across the entire cluster. It may not seem like the most important first step, but if you want to make your environment better long term, and create stability in the cluster, a full understanding of what may have happened, this is essential. You should take 2 to 3 thread dumps per member, 30 seconds to 1 minute apart each. If we have only one such thread dump as a snapshot in time, it is impossible to establish whether some blocked thread is BLOCKED during the natural course of processing, as is normal, or whether it remains blocked over the course of a minute or more, which is not natural and surely indicative of something causing issues in the system. Please add this to any run book you may have, so that the people managing the system at any time understand that gathering the dumps is vital to understanding root cause.
There are a number of symptoms which can give the impression of a blockage in the system. As stated earlier, a gfsh command might not return. This certainly warrants taking the thread dumps across the system as described. There are a number of log messages which are generally an indication of some issue in the system. Here are some examples of log messages which should cause you to examine the DS closely to determine whether some issue exists in the cluster.
- [warning 2015/09/02 08:51:57.846 UTC gemfire-data-node-gfirep05-49002 tid=0x3ee3] 15 seconds have elapsed while waiting for replies: :50621]> on 220.127.116.11(gemfire-data-node-gfirep05-49002:18683):55692 whose current membership list is: [[...]]
- [warning 2015/09/02 07:13:16.049 UTC gemfire-data-node-gfirep05-49002 tid=0x324] 15 seconds have elapsed while waiting for replies: <com.gemstone.gemfire.internal.cache.InitialImageOperation$ImageProcessor 3188 waiting for 1 replies from [18.104.22.168(gemfire-data-node-gfirep06-49002:2283):52424]; waiting for 0 messages in-flight; region=/__PR/_B__XYZ_106; abort=false> on 22.214.171.124(gemfire-data-node-gfirep05-49002:18683):55692 whose current membership list is: [[...]]
- [warning 2015/09/02 07:55:38.388 UTC gemfire-data-node-gfirep04-49002 tid=0xa2e44] Rejected connection from /126.96.36.199 because current connection count of 800 is greater than or equal to the configured max of 800
There are a few different flavors of the "15 seconds have elapsed while waiting for replies" messages. The ack-wait-threshold from your properties is the 15, by default set to 15 seconds. If you have altered this setting to 10, you will see "10 seconds have elapsed...". If you are seeing such messages repeatedly, and the replies are not coming back, this is certainly a scenario to gather thread dumps and try to take action before the issue cascades and impacts other members of the cluster. The same holds true is you are repeatedly seeing DLock requests outstanding.
Another common symptom of an issue may be that you are hitting the max-connections limit. If you generally do not encounter this in your environment, your system may be experiencing some impact to recent events such as starting a member, loss of a member, repeated failures of clients driving up connections used, etc. There exists a CacheServerMXBean interface that you can monitor with a getThreadQueueSize() method. If you see this returning values greater than your normal range, you may be experiencing some issue. You could set an alert if you surpass some value, such as half your max-connections limit, and potentially drive some resolution to your issues prior to any cascading impact to the system. Here is a link to the javadocs
Sometimes things are blocked, but not deadlocked. It is often difficult to tell the difference. That said, one area where customers are often reaching out for assistance is when things appear hung during startup when persistence is part of the configuration. In such cases, some region initialization may be blocked waiting for the latest current data to be online. The following message is an indication that you may need to make sure that all necessary nodes have been started to unblock initialization and complete the startup.
- [info 2015/09/02 08:51:43.212 UTC gemfire-data-node-cer-lx-gfirep05-49002 tid=0x308] Region State (and any colocated sub-regions) has potentially stale data. Buckets [4, 5 ] are waiting for another offline member to recover the latest data. My persistent id is: DiskStore ID: XYZ Name: gemfire-data-node-gfirep05-49002 Location: /188.8.131.52:/path Offline members with potentially new data: [ DiskStore ID: ABC Location: /184.108.40.206:/otherpath Buckets: [4, 5] ] Use the "gemfire list-missing-disk-stores" command to see all disk stores that are being waited on by other members.
In such cases, you need to read the log message carefully, determine what members need to be started to unblock the recover of this persistent disk store, and proceed until all such messages in the logs have been eliminated.