Pivotal Knowledge Base

Follow

Distributed Deadlock when restarting the GemfireXD Servers

Applies to

GemfireXD 1.4.0 to 1.4.x

Purpose

This document describes workarounds and solution to resolve a distributed deadlock issue during the AsyncEventListener Queue recovery when GemfireXD servers is restarting.

Symptom

Server1 log snippet:

GemfireXD servers hang and fail to restart with the following logs:

[info 2015/05/26 18:22:56.977 CST <CacheServerLauncher#serverConnector> tid=0xd] Region /AsyncEventQueue_Listener2_SERIAL_GATEWAY_SENDER_QUEUE has potentially stale data. It is waiting for another member to recover the latest data.
 My persistent id:
 
 DiskStore ID: 33fe3f15-dae9-4433-8297-b2993b492914
 Name: 
 Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver01/./Listener2_DS
 
 Members with potentially new data:
 [
 DiskStore ID: 3bfecd04-1cad-43a9-9187-68179bd6307e
 Name: 
 Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver02/./Listener2_DS
 ]
 Use the "gfxd list-missing-disk-stores" command to see all disk stores that are being waited on by other members.

Server2 log snippet:

[info 2015/05/26 18:22:49.850 CST <CacheServerLauncher#serverConnector> tid=0xd] Region AsyncEventQueue_Listener1_SERIAL_GATEWAY_SENDER_QUEUE requesting initial image from 172.16.43.142(106501)<v2>:26755

[warning 2015/05/26 18:23:04.851 CST <CacheServerLauncher#serverConnector> tid=0xd] 15 seconds have elapsed while waiting for replies: <com.gemstone.gemfire.internal.cache.InitialImageOperation$ImageProcessor 51 waiting for 1 replies from [172.16.43.142(106501)<v2>:26755]; waiting for 0 messages in-flight; region=/AsyncEventQueue_Listener1_SERIAL_GATEWAY_SENDER_QUEUE; abort=false> on 172.16.43.142(106671)<v3>:5021 whose current membership list is: [[pivhdsne(106355)<v1>:41338, 172.16.43.142(106501)<v2>:26755, pivhdsne(106229)<v0>:38577, 172.16.43.142(106671)<v3>:5021]]

Root Cause

The second node is waiting for asynchronize recovery while iterating, which was causing a distributed deadlock.

Workaround

  • Warkaround 1: Set system property RECOVER_VALUES_SYNC to TRUE, so that the Asynceventlisteners Queue data recovery are synchronous and in order, i.e.:

    -J-Dgemfire.disk.recoverValuesSync=true
  • Workaround 2: Use a single AsyncEventListener instead of multiple AsyncEventListeners.

Solution

GemfireXD 1.4.1.1 and above include the fix for this issue:

#52317 Do not wait for async recovery while iterating. It was causing a distributed deadlock.

Comments

Powered by Zendesk