When a distributed dead-lock occurs among GemFire members, you will have to reboot some, and, in the worst case, all of the affected members in the cluster. However, with careful implementation, your GemFire applications can generally avoid such situations.
One common source of distributed dead-locks is code which calls region operations (such as get, put) from within a callback method in synchronous listeners (e.g. a CacheListener). Hence, in order to minimize the possibility of a distributed dead-lock best practice is not to call region operations from within synchronous listeners, or, if you must call region operations within synchronous listeners, the following rules should be observed:
- Use the asynchronous listener (AsyncEventListener) or call region operations asynchronously using a SerialExecutor pattern, as introduced at the following Oracle site:
- Avoid connection level dead-lock, by setting conserve-sockets=false in your gemfire.properties.
- If you are running pre-7 GemFire please read Deadlocks occur between query and cache operations when enabling eviction and persistence to disk in GemFire 6.x
If, despite careful implementation, a distributed dead-lock does happen --- causing some or all members of your distributed system to stop responding -- you will need to determine which members are dead-locked. To start, please review the cache server logs of each member, looking, particularly for messages of the form "xx seconds have elapsed while waiting for replies", like the following:
[warning 2013/12/13 14:40:41.921 EST <main> tid=0x1] 15 seconds have elapsed while waiting for replies: <com.gemstone.gemfire.internal.cache.InitialImageOperation$ImageProcessor 15 waiting for 1 replies from [machine1(13345)<v625>:8122/53172]; waiting for 1 messages in-flight; region=/FooRegion; abort=false> on machine2(43338)<v968>:8123/53622 whose current membership list is: [[(13345)<v625>:8122/53172, machine2(43338)<v968>:8123/53622]]
From this example, you can see that the member running on machine2 (with pid=43338) is waiting for messages from another member running on machine1 (pid=13345) regarding /FooRegion. In the case of distributed dead-lock, you would likely see that the member on machine2 was, in turn, waiting for the member on machine1, which would give you a solid starting place for further investigation.
Further analysis, generally requires that you take thread dumps of the dead-locked members. This can be done using the 'jstack' command (included in the JDK), sending a SIGQUIT signal (UNIX), or sending Ctrl-Break signal (Windows). In the case of jstack, the command line is of the following form:
jstack -l <pid>
From there, you will need to determine which threads are dead-locked. Typically, the dead-locked threads will have stacks calling methods of the ReplyProcessor21 class, like the following:
"Distributed system shutdown hook" prio=6 tid=0x03482000 nid=0x6e0 waiting on
In this case, it appears that the "Distributed system shutdown hook" could be dead-locked, waiting on a reply from another member (which would, in turn, be waiting on a reply from this member).