Pivotal Knowledge Base

Follow

How to manually set the number of mappers in a TEZ Hive job?

Environment

Product Version
Pivotal HD 3.x
HIVE 0.14

Purpose 

While troubleshooting HIVE performance issues when TEZ engine is being used there may be a need to increase the number of mappers used during a query. As the mappers read the data, this can be useful to increase the speed with which data is read. 

For example in this query TEZ decided only one mapper was needed and reading the data with the single mapper took over three minutes (log extract from yarn log -applicationId <ApplicationID>):

2016-06-21 11:12:46,100 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezInitializeOperators start=1466503964475 end=1466503966100 duration=1625 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2016-06-21 11:12:46,107 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1
2016-06-21 11:12:46,122 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10
2016-06-21 11:12:46,208 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100
2016-06-21 11:12:46,461 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000
2016-06-21 11:12:47,569 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10000
2016-06-21 11:12:52,257 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100000
2016-06-21 11:13:36,467 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000000
2016-06-21 11:13:47,485 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:13:47,518 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00001 with {include: null, offset: 0,
length: 9223372036854775807}
2016-06-21 11:13:47,535 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:14:31,069 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:14:31,099 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00002 with {include: null, offset: 0,
length: 9223372036854775807}
2016-06-21 11:14:31,115 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:15:15,038 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:15:15,068 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00003 with {include: null, offset: 0, length: 9223372036854775807}
2016-06-21 11:15:15,082 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: 5 finished. closing...
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: DESERIALIZE_ERRORS:0
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:4882272
2016-06-21 11:15:58,673 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezRunProcessor start=1466503964235 end=1466504158673 duration=194438 from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>

By setting 4 mappers to read the data we were able to get the mapper phase down to 76 seconds.

Procedure

In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration `tez.grouping.split-count` can be used by either: 

  • Setting it when logged into the HIVE CLI : `set tez.grouping.split-count=4` will create 4 mappers
  • An entry in the `hive-site.xml` can be added via Ambari. If set via hive-site.xml HIVE will need to be restarted.

Additional information

Further information around HIVE performance troubleshooting is available in our partner Hortonworks here:

Comments

Powered by Zendesk