Pivotal Knowledge Base

Follow

Running TeraSort MapReduce Benchmark

Environment

Product Version
Pivotal HD All Versions

Purpose

The purpose of TeraSort is to test the CPU/Memory power of the cluster and to sort 1TB of data by the a 10-byte ASCII key in the shortest amount of time possible. The benchmark will vary depending on available cluster resources. 

Basic Workflow

Running TeraGen

Use the following command to run TeraGen:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 10000000000 /teraInput

Argument 1: Number of 100 bytes rows ( 10,000,000,000 ), which is 1TB in this example.

Argument 2: Generated data will be dropped in the HDFS path you enter.

TeraGen will run map tasks to generate the data and will not run any reduce tasks. The default number of map task is defined by the "mapreduce.job.maps=2" param. It's the only purpose here is to generate the 1TB of random data in the following format " 10 bytes key | 2 bytes break | 32 bytes acsii/hex | 4 bytes break |  48 bytes filler | 4 bytes break | \r\n".

Running TeraSort

Use the following command to run TeraSort: 

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput
13/09/23 21:30:21 INFO mapreduce.Job: Running job: job_1379996975669_0001
.
.
13/09/23 22:25:54 INFO terasort.TeraSort: done

This will create a series of map tasks that sort the ASCII key data. There will be one Map task for each HDFS block of data. By default there will be one Reduce task which is defined by "mapreduce.job.reduces=1".

In the example below we force 8 Reducers using switch "-D mapred.reduce.tasks=8".  This should be tuned based on the number of nodes in the cluster so you maximize the full capacity of the cluster.

The data will be partitioned based on the number reduce tasks with a 1:1 ratio. One partition for every reduce task.

NOTE: Yahoo will run a TeraSort benchmark after changing the replication factor to 1 on the parent directory.  

Running TeraValidate

Use the following command to run TeraValidate:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate -D mapred.reduce.tasks=8 /teraOutput /teraValidate

The command above reads the output data and ensures that each key is less than the next key in the entire dataset.

 

Comments

Powered by Zendesk