Pivotal Knowledge Base

Follow

Large hive query fails with "Container killed on request"

Environment

  • PHD 2.1.x
  • PHD 3.x

Symptom

In this case application master suddenly reports 33k "Container killed on request" messages to stdout

Container killed on request. Exit code is 143

Container killed on request. Exit code is 143

Job failed as tasks failed. failedMaps:0 failedReduces:1

Container Logs are misleading and report random error conditions like file does not exist

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hive-svcckp/hive_2015-04-03_22-17-06_046_7422088098942342020-1/_task_tmp.-ext-10000/base_div_nbr=1/retail_channel_code=1/year_nbr=2013/qtr_nbr=2/visit_date=2013-07-05/_tmp.001367_0: File does n
ot exist. Holder DFSClient_attempt_1426272186088_146151_r_001367_0_-2103711889_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2932)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2978)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:611)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:434)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:63013)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

......

Reducer reports errors while processing a row strace (Note key/value message truncated)

2015-04-03 21:09:18,484 FATAL [main] ExecReducer: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{....},"value":{...}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:258)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
:

        at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
        at org.apache.hadoop.util.Shell.run(Shell.java:379)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:311)
        ... 6 more
        

Cause

User was running a job that normally succeeds, however in this case the dataset size increased by a factor of 3. This resulted in application master launching over 30k containers but was using the default memory size of 2gb. This resulted in lots of GC pauses and application master was simply not able to keep up with the number of containers.

Fix

Using mapreduce param "yarn.app.mapreduce.am.resource.mb" to increase the AM memory to a higher value. You can review this kb for determining the best value of this parameter.

Comments

Powered by Zendesk