Pivotal Knowledge Base

Follow

Fair scheduler stop submitting jobs

Environment

PHD 1.1.1

PHD 2.0.1

Customer uses YARN fair scheduler to submit mapreduce jobs and the workload are consistently high.

Observations

From resourcemanager log, you can see a job submitted by fair scheduler successfully has the following log trace:

2014-08-11 07:18:04,017 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rsampat  IP=10.224.135.98        OPERATION=Submit Application Request    TARGET=ClientRMServic
e  RESULT=SUCCESS  APPID=application_1407533287136_3229
2014-08-11 07:18:04,017 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1407533287136_3229
2014-08-11 07:18:04,017 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1407533287136_3229 State change from NEW to NEW_SAVING
2014-08-11 07:18:04,017 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1407533287136_3229
2014-08-11 07:18:04,018 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1407533287136_3229 State change from NEW_SAVING to SUBMITTED
2014-08-11 07:18:04,018 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1407533287136_3229_000001
2014-08-11 07:18:04,018 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1407533287136_3229_000001 State change from NEW to SUBMITTED
2014-08-11 07:18:04,018 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1407533287136_3229_000001, user: rsampat, currently active: 16

 And here are the logs for a failed job.

2014-08-11 09:26:42,596 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=abhatt1 IP=10.224.137.114 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1407533287136_3564
2014-08-11 09:26:42,596 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1407533287136_3564
2014-08-11 09:26:42,596 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1407533287136_3564 State change from NEW to NEW_SAVING
2014-08-11 09:26:42,597 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1407533287136_3564
2014-08-11 09:26:42,597 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1407533287136_3564 State change from NEW_SAVING to SUBMITTED
2014-08-11 09:26:42,597 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1407533287136_3564_000001
2014-08-11 09:26:42,597 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1407533287136_3564_000001 State change from NEW to SUBMITTED
2014-08-11 09:26:42,597 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1407533287136_3564_000001

Root cause

This is a known YARN bug, please check details in  YARN-1458. Pivotal internal BUG ID is HD-11280.

You may see this type of issue when fair scheduler workloads are high and a yarn parameter yarn.scheduler.fair.sizebasedweight  is set to true

Solution and workaround

Pivotal will include the patch of YARN-1458 once it is committed in the open source community.

To workaround the issue, change the parameter yarn.scheduler.fair.sizebasedweight to false in yarn-site.xml on all YARN nodes and restart the cluster.

 

 

 
 
 
 

Comments

Powered by Zendesk