Pivotal Knowledge Base

Follow

Failed to start PHD cluster with icm_client after re-deployment

Environment

  • PHD 1.1.1 and PHD 2.x with NameNode HA

Symptom
After PHD cluster is uninstalled with "icm_client uninstall" and deployed again with "icm_client deoply", start up the cluster with "icm_client start" sometimes fails with following errors.

[gpadmin@admin ~]$icm_client start -l phd1
Starting services
Starting cluster
[================= ] 17%
[ERROR] Failed to start the cluster. Reason: Server error:
Return Code : 5000
Message : Cluster Start Error
Details :
Admin Host :
Operation Error : Error while calling start for role namenode. null
massh /tmp/tmp.ghrXxgP5CP verbose 'sudo /etc/init.d/hadoop-hdfs-namenode status || sudo /etc/init.d/hadoop-hdfs-namenode start'
{}
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
ERROR_RESPONSE{"RESOLUTION": "Please check component log file for more details.", "OPERATION_CODE": "COMPONENT_START_ERROR", "LOG_FILE": "/var/log/gphd/hadoop-hdfs/hadoop-hdfs-namenode-hdm1.hadoop.local.log, /var/log/gphd/hadoop-hdfs/hadoop-hdfs-namenode-hdm1.hadoop.local.out", "OPERATION_ERROR": "Failed to start component namenode", "FAILED_HOSTS": "hdm1.hadoop.local"}

Log File : /var/log/gphd/gphdmgr/gphdmgr-webservices.log

The NameNode logs on failed host show the below error messages.

014-10-06 17:09:02,249 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2014-10-06 17:09:02,250 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2014-10-06 17:09:02,250 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2014-10-06 17:09:02,251 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data/nn/dfs/name state: NOT_FORMATTED
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:285)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:200)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)
2014-10-06 17:09:02,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2014-10-06 17:09:02,258 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

Cause
The issue may be caused for several reasons.
1) After cluster unstallation user only cleans data under dfs.namenode.name.dir on NameNode, but forgets to clean data under dfs.journalnode.edits.dir on JournalNodes. In this case icm_client won't format HDFS as there is data left in dfs.journalnode.edits.dir on JournalNodes
2) If user realizes that"icm_client start" fails due to not cleaning data in dfs.journalnode.edits.dir on JournalNodes and then does the cleanup, "icm_client start" could fail again with same error. This is because the first "icm_client start" already makes some changes to some table in Command Center database, which makes the later "icm_client start" won't format NameNode
Fix
For both of the cases listed above it could be resolved by the following steps
a. "icm_client stop -l <cluster>" to stop the cluster
b. "icm_client uninstall -l <cluster>" to uninstall it
c. Remove all data under dfs.namenode.name.dir on NameNode and dfs.journalnode.edits.dir on JournalNodes
d. "icm_client deploy -c <configDir>" to deploy the cluster again
e. "icm_client start -l <cluster>" to start up the cluster

Comments

Powered by Zendesk