Pivotal Knowledge Base

Follow

Starting PHD cluster was failed with "namespaceID is incompatible with others"

Environment

Product Version
Pivotal Hadoop 2.x
OS RHEL 6.x

Problem

Starting PHD cluster with icm_client failed when trying to start NameNode service.

[dev-adn01:gpadmin/1016]$icm_client start -l devsipo
Starting services
Starting cluster
[========= ] 9%
[ERROR] Failed to start the cluster. Reason: Server error:
Return Code : 5000
Message : Cluster Start Error
Details :
Admin Host :
Operation Error : Error while calling start for role namenode. null
massh /tmp/tmp.LAqssd0uO4 verbose 'sudo /etc/init.d/hadoop-hdfs-namenode status || sudo /etc/init.d/hadoop-hdfs-namenode start'
{}
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
ERROR_RESPONSE{"RESOLUTION": "Please check component log file for more details.", "OPERATION_CODE": "COMPONENT_START_ERROR", "LOG_FILE": "/var/log/gphd/hadoop-hdfs/hadoop-hdfs-namenode-dev-nmn01.ccl.local.log, /var/log/gphd/hadoop-hdfs/hadoop-hdfs-namenode-dev-nmn01.ccl.local.out", "OPERATION_ERROR": "Failed to start component namenode", "FAILED_HOSTS": "dev-nmn01.ccl.local"}

Messages in NameNode log file show that NameNode service failed to be started up due to error "Directory /data/nfs_nn/dfs/name is in an inconsistent state: namespaceID is incompatible with others."

2015-02-16 09:59:56,701 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/nfs_nn/dfs/name is in an inconsistent state: namespaceID is incompatible with others.
at org.apache.hadoop.hdfs.server.common.Storage.setNamespaceID(Storage.java:1093)
at org.apache.hadoop.hdfs.server.common.Storage.setFieldsFromProperties(Storage.java:891)
at org.apache.hadoop.hdfs.server.namenode.NNStorage.setFieldsFromProperties(NNStorage.java:585)
at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:921)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:304)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:200)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)

Cause

Two directories are specified in hdfs-site.xml for parameterdfs.namenode.name.dir on NameNode.

<property>
<name>dfs.namenode.name.dir</name>
<value>/data/nn/dfs/name,/data/nfs_nn/dfs/name</value>
</property>

However the VERSION file in the 2 directories contains different namespaceID as well as some other entries (like clusterID). 

[dev-nmn01:root/1042]#cat /data/nfs_nn/dfs/name/current/VERSION
#Tue Oct 28 18:02:39 JST 2014
namespaceID=887761746
clusterID=CID-5cd55986-9e64-47e8-9b2e-d2e005108bf3
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1395184149-10.22.242.41-1414486959034
layoutVersion=-47
[dev-nmn01:root/1043]#cat /data/nn/dfs/name/current/VERSION #Sat Dec 20 12:38:51 JST 2014
namespaceID=725749472
clusterID=CID-d80ce747-584a-47fe-acf8-67fb34520b4b
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1789713236-10.22.242.41-1414764251434
layoutVersion=-47

It's finally found that directory /data/nfs_nn/dfs/name contains out-of-date data which was probably caused by disk replacement with the new disk drive from another Hadoop cluster. So the cause of this issue was the invalid data in /data/nfs_nn/dfs/name.

Fix

As data in /data/nn/dfs/name is valid, empty /data/nfs_nn/dfs/name first and then start NameNode service will lead to automatic data replication from /data/nn/dfs/name to /data/nfs_nn/dfs/name.

Following are the steps:

1. Rename or remove /data/nfs_nn/dfs/name/current

2. Start cluster again with "icm_client start"

Comments

Powered by Zendesk