Pivotal Knowledge Base

Follow

Node fails to start due to EOFException when requesting configuration from locator

Environment

Pivotal GemFire 8+

Symptom

A GemFire peer may exit immediately on startup with the following exception in its logs, standard output, or, when running under Tomcat, the catalina.out file:

java.io.EOFException: Locator at 10.0.0.11(locator2:25967):10334 did not respond. This is normal if the locator was shutdown. If it wasn't check its log for exceptions.
  at com.gemstone.org.jgroups.stack.tcpserver.TcpClient.requestToServer(TcpClient.java:125)
  at com.gemstone.org.jgroups.stack.tcpserver.TcpClient.requestToServer(TcpClient.java:78)
  at com.gemstone.gemfire.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:171)
  at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.requestAndApplySharedConfiguration(GemFireCacheImpl.java:868)
  at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.init(GemFireCacheImpl.java:1019)
  at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:684)
  at com.gemstone.gemfire.cache.CacheFactory.create(CacheFactory.java:182)
  at com.gemstone.gemfire.cache.CacheFactory.create(CacheFactory.java:229)

Investigating the relevant locator logs reveals a message like:

[error 2017/12/07 22:09:48.877 UTC   tid=0x3f] Error occurred while initializing cluster configuration
java.lang.Exception: Error occurred while initializing cluster configuration

This indicates that the actual issue is with the cluster configuration service on the locator.

Cause

There are a number of possible reasons for the cluster configuration service to fail to start, but we'll focus on one particular case. The cluster configuration service relies on a shared region called the _ConfigurationRegion Any failure to create this region will cause the configuration service to not come up. In particular, issues with the disk store, such as a ConflictingPersistentDataException:

[info 2017/12/07 22:09:48.861 UTC   tid=0x3f] Initializing region _ConfigurationRegion

[warning 2017/12/07 22:09:48.867 UTC   tid=0x3f] Initialization failed for Region /_ConfigurationRegion
com.gemstone.gemfire.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion remote member 10.0.0.10(16303:locator):29734 with persistent data /10.0.0.10:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1512684585365 version 0 diskStoreId 3aba183676bc4730-99cf92cee56428a9 name  was not part of the same distributed system as the local data from /10.0.0.11:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1477323218275 version 0 diskStoreId d73a8b008ea14ed1-8f53d86370cc5ab5 name 
  at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:714)
  at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:786)
  at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47)
  at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1381)
  at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1209)
  at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2994)
  at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:635)
  at com.gemstone.gemfire.distributed.internal.SharedConfiguration.initSharedConfiguration(SharedConfiguration.java:272)
  at com.gemstone.gemfire.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:726)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:693)
  at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:964)
  at java.lang.Thread.run(Thread.java:745)

[error 2017/12/07 22:09:48.877 UTC   tid=0x3f] Error occurred while initializing cluster configuration
java.lang.Exception: Error occurred while initializing cluster configuration
  at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:646)
  at com.gemstone.gemfire.distributed.internal.SharedConfiguration.initSharedConfiguration(SharedConfiguration.java:272)
  at com.gemstone.gemfire.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:726)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:693)
  at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:964)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.gemstone.gemfire.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion remote member 10.0.0.10(16303:locator):29734 with persistent data /10.0.0.10:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1512684585365 version 0 diskStoreId 3aba183676bc4730-99cf92cee56428a9 name  was not part of the same distributed system as the local data from /10.0.0.11:/opt/gemfire-locator/ConfigDiskDir_ created at timestamp 1477323218275 version 0 diskStoreId d73a8b008ea14ed1-8f53d86370cc5ab5 name 
  at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:714)
  at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:786)
  at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47)
  at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1381)
  at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1209)
  at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2994)
  at com.gemstone.gemfire.distributed.internal.SharedConfiguration.getConfigurationRegion(SharedConfiguration.java:635)
  ... 9 more
  

Resolution

In this case, the other locator, running locator, 10.0.0.10(16303:locator):29734, will have more current data, which this locator will recover from. Hence, the solution is to shutdown the problematic locator, remove its persistence files, restart it, and allow it to recover the _ConfigurationRegion from the healthy locator.

After doing so, the logs of the restarted locator should indicate that the region was successfully recovered from the other locater and the cluster configuration service started. Something like:

[info 2017/12/07 22:09:45.350 UTC  tid=0x7a] Region /_ConfigurationRegion requesting initial image from 10.0.0.10(16303:locator):29734

[info 2017/12/07 22:09:45.360 UTC  tid=0x7a] /_ConfigurationRegion is done getting image from 10.0.0.10(16303:locator):29734. isDeltaGII is false

[info 2017/12/07 22:09:45.375 UTC   tid=0x7a] Initialization of region _ConfigurationRegion completed

[info 2017/12/07 22:09:45.391 UTC   tid=0x7a] Completed writing the shared configuration to 'cluster_config' directory

[info 2017/12/07 22:09:45.391 UTC   tid=0x7a] Getting Jar files from other locators

[info 2017/12/07 22:09:45.411 UTC   tid=0x7a] No deployed jars found on other locators.

[info 2017/12/07 22:09:45.415 UTC   tid=0x7a] ConfigRequestHandler installed

[info 2017/12/07 22:09:45.416 UTC   tid=0x7a] Cluster configuration service start up completed successfully and is now running ....

Additional Information

For additional information on troubleshooting configuration service issues, see the relevant section in the User's Guide.

Comments

Powered by Zendesk