Pivotal Knowledge Base

Follow

About PHD default capacity scheduler data locality behavior

Environment

  • PHD 2.x
  • PHD 1.x
  • YARN / HDFS Rack Awareness implemented

About PHD default capacity scheduler data locality behavior

After installing PHD the capacity scheduler will have data locality disabled by default.  In large cluster with 40 or more nodes users will see all the containers for a given application run on only 2 or 3 nodes.  

How does the capacity scheduler determine which nodemanager gets the next containers when data locality is disabled?  

Capacity scheduler will assign resources to a nodemanager after the nodemanager has checked in with its heartbeat. The last heartbeat received when the job gets submitted will be the nodemanager that gets considered for container allocation.  Assuming the last nodemanager that checked in has 48GB free space and the container memory allocation is set to 2GB then capacity scheduler will go ahead and allocate 24 containers to this single node when data locality is disabled. 

What happens when data locality is enabled on a small cluster with only 4 nodemanagers and there are 32 2GB containers that need to be executed for a single application?

Example Environemnt

  • 4 nodemanagers with 48GB of memory
  • Mapreduce application with 32 map containers
  • Each map container needs to allocation 2GB
  • A single nodemanager can accept 24  2GB map containers

The capacity scheduler will assign resources to a nodemanager after the nodemanager has checked in with its heartbeat. The last heartbeat received when the job gets submitted will determine which nodemanager gets considered for container allocation. In this example the node manager will report 48GB free and the scheduler will assign 24 containers to a single nodemanager. Once this nodemanager is full the capacity scheduler will perform the same task when selecting the next nodemanager for the remaining 8 containers.
In a four node cluster with a replication factor of 3 there is a good chance that the data will be located on the selected nodemanager for most of the map tasks. As long as the nodemanagers memory and cpu resources are configured properly then this behavior should not cause much of an impact because the scheduler is designed to consume 100% of resources on each node.

Disable/Enable data locality 

By default data locality is disabled by param yarn.scheduler.capacity.node-locality-delay with a value of "-1" and this value should be set to the same number of racks in the cluster.  This assumes yarn hdfs are configured with rack awareness

/etc/gphd/hadoop/conf/capacity-scheduler.xml

<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>-1</value>
<description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers.
Typically this should be set to number of racks in the cluster, this
feature is disabled by default, set to -1.
</description>
</property>

Restart the resource manager after the changes are made. 

Comments

  • Avatar
    Gurmukh Singh

    Hi Daniel,

    Did you mean number of nodes in a rack ? Because setting "yarn.scheduler.capacity.node-locality-delay" equal to number of racks will not help much as compared to the number of nodes in a rack.

Powered by Zendesk