GemfireXD 1.4.0 to 1.4.x
This document describes workarounds and solution to resolve a distributed deadlock issue during the AsyncEventListener Queue recovery when GemfireXD servers is restarting.
Server1 log snippet:
GemfireXD servers hang and fail to restart with the following logs:
[info 2015/05/26 18:22:56.977 CST <CacheServerLauncher#serverConnector> tid=0xd] Region /AsyncEventQueue_Listener2_SERIAL_GATEWAY_SENDER_QUEUE has potentially stale data. It is waiting for another member to recover the latest data. My persistent id: DiskStore ID: 33fe3f15-dae9-4433-8297-b2993b492914 Name: Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver01/./Listener2_DS Members with potentially new data: [ DiskStore ID: 3bfecd04-1cad-43a9-9187-68179bd6307e Name: Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver02/./Listener2_DS ] Use the "gfxd list-missing-disk-stores" command to see all disk stores that are being waited on by other members.
Server2 log snippet:
[info 2015/05/26 18:22:49.850 CST <CacheServerLauncher#serverConnector> tid=0xd] Region AsyncEventQueue_Listener1_SERIAL_GATEWAY_SENDER_QUEUE requesting initial image from 172.16.43.142(106501)<v2>:26755 [warning 2015/05/26 18:23:04.851 CST <CacheServerLauncher#serverConnector> tid=0xd] 15 seconds have elapsed while waiting for replies: <com.gemstone.gemfire.internal.cache.InitialImageOperation$ImageProcessor 51 waiting for 1 replies from [172.16.43.142(106501)<v2>:26755]; waiting for 0 messages in-flight; region=/AsyncEventQueue_Listener1_SERIAL_GATEWAY_SENDER_QUEUE; abort=false> on 172.16.43.142(106671)<v3>:5021 whose current membership list is: [[pivhdsne(106355)<v1>:41338, 172.16.43.142(106501)<v2>:26755, pivhdsne(106229)<v0>:38577, 172.16.43.142(106671)<v3>:5021]]
The second node is waiting for asynchronize recovery while iterating, which was causing a distributed deadlock.
Warkaround 1: Set system property RECOVER_VALUES_SYNC to TRUE, so that the Asynceventlisteners Queue data recovery are synchronous and in order, i.e.:
Workaround 2: Use a single AsyncEventListener instead of multiple AsyncEventListeners.
GemfireXD 126.96.36.199 and above include the fix for this issue:
#52317 Do not wait for async recovery while iterating. It was causing a distributed deadlock.