Pivotal Knowledge Base

Follow

Oozie logs report "Unknown hadoop job" and history server UI not populated

Symptom

Oozie workflow fails, and the logs report "Unknown hadoop job"

oozie job -oozie http://pccadmin.phd.local:11000/oozie -log 0000000-140612202641089-oozie-oozi-W

2014-06-12 20:27:41,298 WARN MapReduceActionExecutor:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000000-140612202641089-oozie-oozi-W] ACTION[0000000-14061220264
1089-oozie-oozi-W@mr-node] Exception in check(). Message[JA017: Unknown hadoop job [job_1402629939350_0001] associated with action [0000000-140612202641089-oozie-oozi-W@mr-node].Failing this action!]
org.apache.oozie.action.ActionExecutorException: JA017: Unknown hadoop job [job_1402629939350_0001] associated with action [0000000-140612202641089-oozie-oozi-W@mr-node]. Failing this action!
at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1009)

Cause

(Note: This article describes only the logging aspect related to History server & Oozie. Please refer the article https://support.gopivotal.com/hc/en-us/articles/201925118-How-to-find-and-review-logs-for-yarn-mapreduce-jobs for more on yarn logs )

When an Oozie workflow is executed, Oozie checks the status of job and while the job is running Oozie will report the status as running, however after the job completes, it queries the data from history server and if the job id is not find at history server, Oozie fails to get the status and marks the status of the workflow as failed. However, the workflow may have finished successfully and the output will be available. Resource Manager will also report the status of the application executed as FINISHED / SUCCEEDED.

After the completion of a map reduce job, logs are written to hdfs under the directory specified by mapreduce.jobhistory.intermediate-done-dir parameter in /etc/gphd/hadoop/conf/mapred-site.xml. 

History server continuously scans the intermediate directory and pulls any new logs if available and copies those logs to the directory specified by mapreduce.jobhistory.done-dir parameter in mapred-site.xml. From this location, history server pulls up the logs and displays them on the history server UI.

If node managers and history server does not have the same location for intermediate directory, application logs will not be find by history server, thus will not be moved to mapreduce.jobhistory.done-dir and will not be available at history server GUI.

Below is the snippet from JavaActionExectuor.java programs which bails out due to the error if the job is not found (null is returned)

RunningJob runningJob = getRunningJob(context, action, jobClient);
if
(runningJob == null) {

context.setExternalStatus(FAILED);
context.setExecutionData(FAILED, null);
thrownew ActionExecutorException(ActionExecutorException.ErrorType.FAILED, "JA017",
"Unknown hadoop job [{0}] associated with action [{1}]. Failing this action!", action
.getExternalId(), action.getId());
}

In this particular case, yarn-site.xml and mapred-site.xml had different values for the parameter mapreduce.jobhistory.intermediate-done-dir across all the nodes of the cluster.

Incorrect Parameters:

yarn-site.xml
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/user/history/done</value>
</property> mapred-site.xml <property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/user/history/done_intermediate</value>
</property>

mapreduce.jobhistory.intermediate-done-dir parameter in yarn-site.xml at Node Manager was being picked, and the Map Reduce jobs were pushing the logs to /user/history/done/${username} hdfs directory. But history server was looking for the logs at /user/history/done_intermediate/${username} resulting in such a behavior.

In this case, there was no error reported by History server since it considered that there is no logs written to the directory which it is scanning.

Fix

1. Ensure that the below 2 parameters are same across all the nodes:

  1. mapreduce.jobhistory.intermediate-done-dir
  2. mapreduce.jobhistory.done-dir

2. Restart YARN services and History Server

 

 
 

Comments

  • Avatar
    Pranab Dash

    Hey Bhuvnesh, This is little confusing WRT which file both the params should be configured. I have PHD 2.1 installation (with name node and secondary name node in the automatic HA ). When I checked the files I found both the params "mapreduce.jobhistory.intermediate-done-dir" and "mapreduce.jobhistory.done-dir" configured in the "mapred-site.xml" file as follows.

    Host : Namenode :
    path : /etc/gphd/hadoop/conf/mapred-site.xml
    <property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/user/history/done_intermediate</value>
    </property>

    <property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/user/history/done</value>
    </property>

    I did not find any of the above two parms in the "yarn-site.xm" file though.

    As per the resolution provided I have changed the value of "mapreduce.jobhistory.intermediate-done-dir" param to "/user/history/done" as follows (on all the nodes including the name node, secondary name node and the datanodes ) and restarted the whole cluster using icm_client.

    Host : Namenode
    path : /etc/gphd/hadoop/conf/mapred-site.xml

    <property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/user/history/done</value>
    </property>

    <property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/user/history/done</value>
    </property>

    Still there is no luck and the MR jobs created through oozie is failing with the following error. Could please clarify it the above two property entries also needs to be inside the "yarn-site.xml" file.

    -===============================================================================================================================================================================
    [root@liacm209 oozie-example]# su - gpadmin
    [gpadmin@liacm209 ~]$ cd /tmp/oozie-example/
    [gpadmin@liacm209 oozie-example]$ oozie job -oozie http://10.63.33.210:11000/oozie -config ./examples/apps/map-reduce/job.properties -run
    job: 0000009-150114055646397-oozie-oozi-W
    [gpadmin@liacm209 oozie-example]$ oozie job -oozie http://10.63.33.210:11000/oozie -log 0000009-150114055646397-oozie-oozi-W
    2015-01-15 05:17:05,431 INFO ActionStartXCommand:539 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@:start:] Start action [0000009-150114055646397-oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
    2015-01-15 05:17:05,431 WARN ActionStartXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@:start:] [0000009-150114055646397-oozie-oozi-W@:start:]Action status=DONE
    2015-01-15 05:17:05,431 WARN ActionStartXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@:start:] [0000009-150114055646397-oozie-oozi-W@:start:]Action updated in DB!
    2015-01-15 05:17:05,517 INFO ActionStartXCommand:539 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] Start action [0000009-150114055646397-oozie-oozi-W@mr-node] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
    2015-01-15 05:17:05,733 WARN MapReduceActionExecutor:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] credentials is null for the action
    2015-01-15 05:17:06,089 WARN MapReduceActionExecutor:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] Exception in check(). Message[JA017: Unknown hadoop job [job_local287356551_0009] associated with action [0000009-150114055646397-oozie-oozi-W@mr-node]. Failing this action!]
    org.apache.oozie.action.ActionExecutorException: JA017: Unknown hadoop job [job_local287356551_0009] associated with action [0000009-150114055646397-oozie-oozi-W@mr-node]. Failing this action!
    at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1009)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:947)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:215)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:60)
    at org.apache.oozie.command.XCommand.call(XCommand.java:280)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    2015-01-15 05:17:06,090 WARN ActionStartXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] Error starting action [mr-node]. ErrorType [FAILED], ErrorCode [JA017], Message [JA017: Unknown hadoop job [job_local287356551_0009] associated with action [0000009-150114055646397-oozie-oozi-W@mr-node]. Failing this action!]
    org.apache.oozie.action.ActionExecutorException: JA017: Unknown hadoop job [job_local287356551_0009] associated with action [0000009-150114055646397-oozie-oozi-W@mr-node]. Failing this action!
    at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1009)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:947)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:215)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:60)
    at org.apache.oozie.command.XCommand.call(XCommand.java:280)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    2015-01-15 05:17:06,090 WARN ActionStartXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] Failing Job due to failed action [mr-node]
    2015-01-15 05:17:06,091 WARN LiteWorkflowInstance:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] Workflow Failed. Failing node [mr-node]
    2015-01-15 05:17:06,237 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] callback for action [0000009-150114055646397-oozie-oozi-W@mr-node]
    2015-01-15 05:17:06,243 ERROR CompletedActionXCommand:536 - USER[-] GROUP[-] TOKEN[] APP[-] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] XException,
    org.apache.oozie.command.CommandException: E0800: Action it is not running its in [PREP] state, action [0000009-150114055646397-oozie-oozi-W@mr-node]
    at org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85)
    at org.apache.oozie.command.XCommand.call(XCommand.java:251)
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    2015-01-15 05:17:07,111 WARN CoordActionUpdateXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[0000009-150114055646397-oozie-oozi-W@mr-node] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
    2015-01-15 05:17:07,153 INFO KillXCommand:539 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[-] STARTED WorkflowKillXCommand for jobId=0000009-150114055646397-oozie-oozi-W
    2015-01-15 05:17:07,167 WARN CoordActionUpdateXCommand:542 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
    2015-01-15 05:17:07,168 INFO KillXCommand:539 - USER[gpadmin] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000009-150114055646397-oozie-oozi-W] ACTION[-] ENDED WorkflowKillXCommand for jobId=0000009-150114055646397-oozie-oozi-W
    [gpadmin@liacm209 oozie-example]$
    Connection closed by foreign host.

    =====================================================================================================================================================================================

  • Avatar
    Pranab Dash

    Further to the above , i would like to add the following information :

    • The MR jobs created by the oozie are not listed in the job history server

    • When I pass the following command (as gpadmin), I don't see any .jhist file

      hdfs dfs -ls /user/history/done/2015/01/14/000000

    • When I run any other MR job (not through oozie), it completes successfully and listed in the history server and the corresponding .jhist file created under /user/history/done folder

    If you can plz help me fix the issue, it will be really nice. Please let me know if you need any further information.

  • Avatar
    Bhuvnesh

    Pranab, each of these parameters should have consistent values on all the cluster nodes and not only on namenode. Although value for mapreduce.jobhistory.done-dir should be different from mapreduce.jobhistory.intermediate-done-dir. I will update it with example to make it more clear..

    Intention is that on 2 different nodes the value of mapreduce.jobhistory.intermediate-done-dir should not be different, same applies for mapreduce.jobhistory.done-dir
    They are required only in mapred-site.xml. [But in case you have them in both the files, ensure they are consistent]

    you can set these parameters in mapred-site.xml only on all the nodes. Example below, so please correct them and restart..
    <property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/user/history/done</value>
    </property>
    <property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/user/history/done_intermediate</value>
    </property>
    Did you see any errors in history server logs ?

Powered by Zendesk