10 cluster nodes
A hive job which needs to process 40TB of data set failed with the following error
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157) ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hive-svcckp/hive_2014-08-12_21-58-23_567_8729876893578629726-1/_task_tmp.-ext-10002/country_code=US/base_div_nbr=1/retail_channel_code=1/visit_date=2011-12-22/load_timestamp=20140810001530/_tmp.002881_0: File does not exist. Holder DFSClient_attempt_1407874976831_0233_m_000178_0_-1560055874_1 does not have any open files. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2932) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2738) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2646) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) .....
Also, the customer reported that the same type job worked well before.
Root cause and Solution
Based on our observation, the problem may be triggered by the following issue:
The job needs to open more files which exceeded the maximum number of open files specified in the ulimit. For the customer who reported this issue, the problem was triggered by a fact that half of their Hadoop cluser nodes were dead because of power supply failure. Therefor, the rest of the nodes need to handle more workflow which requires more files to be open. The maximum number of open files on each node was not able to handle the extra workloads.
To confirm the issue, you may run the following command on a few nodemanager hosts while the problem is happening.
lsof |wc -l
When you see this problem, consider increasing the ulimit maximum number of open files for the user runs the job on all nodemanager hosts.
Note: for a secured cluster, the job owner is the user who launched the job. For non-secure, yarn user owns the job.
An example of setting maximum number of open files to 3 million for gpadmin user:
Add the following line in /etc/security/limit.conf gpadmin nofile 3000000
Then login as gpadmin and run ulimit -n 3000000
To make the permanent change, add the following line in /etc/security/limit.d/gpadmin. Create this file if not exist.
gpadmin - nofile 3000000
To confirm the change, run following command as gpadmin
To confirm the change for nologin users, run the following command ( user yarn user as an example)
sudo -u yarn sh -c "ulimit -a && exec su -u yarn"