Pivotal Knowledge Base

Follow

EOFExceptions or SocketTimeoutExceptions Unexpectedly Occur when Starting Gateway Senders

Environment

Pivotal GemFire 8.2.x, 9.3.x, and older 9.x releases

Symptom

When trying to start Gateway Sender nodes, EOFExceptions or SocketTimeoutExceptions are seen with the warning log messages like the following, despite the absence of network or data stream issues (The messages and stack traces will vary depending on the GemFire version.):

[warning 2018/04/16 10:34:30.856 UTC SCache1 <Event Processor for GatewaySender_GwSender1.2> tid=0x5a] Could not connect to: 192.168.100.2:41144
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.geode.cache.client.internal.ClientSideHandshakeImpl.handshakeWithServer(ClientSideHandshakeImpl.java:267)
        at org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:118)
        at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:137)
        at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:259)
        at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:242)
        at org.apache.geode.cache.client.internal.PoolImpl.acquireConnection(PoolImpl.java:910)
  :
[warning 2018/04/16 10:35:14.171 UTC SCache1 <Event Processor for GatewaySender_GwSender1.3> tid=0x5b] Could not connect to: 192.168.100.2:41132
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.net.SocketInputStream.read(SocketInputStream.java:224)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.geode.internal.cache.tier.sockets.HandShake.handshakeWithServer(HandShake.java:1278)
  :
[warning 2018/04/16 18:28:41.122 UTC SCache1 <Event Processor for GatewaySender_sender_1> tid=0x47] Could not connect to:192.168.100.2:11538
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.net.SocketInputStream.read(SocketInputStream.java:224)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at com.gemstone.gemfire.internal.cache.tier.sockets.HandShake.greet(HandShake.java:1345)
        at com.gemstone.gemfire.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:111)

Cause

The log messages mentioned above generally indicate network connection issues. If the issues have been eliminated, another possible cause can be problems with the PDX configuration on the Gateway Receiver side. In this case, an issue may happen during the connection process when Gateway Sender Event Processor threads try to read the PDX registry size from the Gateway Receiver. A misconfiguration in the PDX settings on the Gateway Receiver can cause the Gateway Sender Event Processor threads to fail to properly complete the exchange and as a result, log messages like the above will be logged.

Resolution

This issue can be resolved by correcting the misconfiguration on the Gateway Receiver. A common situation is a misspelling of the PDX disk store name in the cache.xml which results in specifying a non-existing disk store name in the PDX configuration. For example,

<disk-store name="pdxDataStore">
<disk-dirs>
<disk-dir>/path/to/pdx_data_store</disk-dir>
</disk-dirs>
</disk-store>
<pdx read-serialized="true" persistent="true" disk-store-name="pdxxDataStttore" ...

In this case, simply correcting disk store name in the PDX configuration will resolve the issue.

Beginning with GemFire 9.4, this situation is handled by GemFire and this behavior from a simple PDX misconfiguration will not be seen.

Additional Information

This behavior can continue even after upgrading GemFire 9.4. In that case, clear the locator's metadata from the locator directories before restarting the clusters.

Comments

Powered by Zendesk