Pivotal Knowledge Base

Follow

NameNode is Unresponsive and Streaming "Recovering Lease" Messages

Environment

 Product  Version
 Pivotal HD  2.x
 Pivotal HAWQ  1.2

Symptom 

Pivotal HAWQ queries may be failing because Pivotal HD NameNode has become unresponsive (does not respond to basic commands such as hdfs dfs -ls /). 

The NameNode log in /var/log/gpdb/hadoop-hdfs/ will be streaming a large number of messages such as this: 

2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1], s
rc=/hawq_data/gpseg14/16385/76836054/79884472.1
2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1] has expired h
ard limit
2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1], s
rc=/hawq_data/gpseg14/16385/76836054/79884472.1
2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1] has expired h
ard limit
2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1], s
rc=/hawq_data/gpseg14/16385/76836054/79884472.1
2017-01-05 11:33:45,923 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: libhdfs3_client_random_511975289_count_1_pid_776687_tid_140094310148128, pendingcreates: 1] has expired h
ard limit

Cause

This is due to a software defect https://issues.apache.org/jira/browse/HDFS-4882, where the NameNode may become unresponsive if a large number of leases expires on a file in HDFS. 

Resolution

1. Attempt to shut down the NameNode using service hadoop-hdfs-namenode stop.

2. If the service command was unable to stop the NameNode service, it may be necessary to use the kill command to kill the NameNode service. 

3. Restart the NameNode service: Start the hadoop-hdfs-namenode service but do NOT allow any applications to write to HDFS. 

4. Once the NameNode has started, wait for one hour and then run fsck with the "OPENFORWRITE" option to see if any files are stuck in OPENFORWRITE: 

hdfs fsck -files -blocks -locations -openforwrite | grep OPENFORWRITE

5. If there are no files stuck in OPENFORWRITE, the system can be used normally. 

6. If there are files stuck in OPENFORWRITE, they must be cleaned up, otherwise, the NameNode will become unresponsive again after a few days.  

Comments

Powered by Zendesk