Pivotal Knowledge Base

Follow

Hive UADF SUM function returns null

Environment

  • PHD 2.0.1.0
  • HIVE 0.12.0

Symptom

When executing sum against a decimal column hive returns NULL as the value

hive> select sum(retail_price) from reatail_repro;
Job 0: Map: 10  Reduce: 1   Cumulative CPU: 530.03 sec   HDFS Read: 2600574562 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 50 seconds 30 msec
OK
NULL

But if you cast the column to double or bigint then value is returned

hive> select sum(double(retail_price)) from retail_repro;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 16.48 sec   HDFS Read: 30644180 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 480 msec
OK
5.1994505839339055E7

hive> select sum(bigint(retail_price)) from retail_repro;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 14.31 sec   HDFS Read: 30644180 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 310 msec
OK
48073762

Cause

We found that some values in the column retail_price were 32 digits

0.0000000000000008881784197001252

Given the MAX PERCISION is 38 digits for HiveDecimal in 0.12.0 the SUM function will return null if HiveDecimal throws NumberFormatException. The problem in this case is the mapper process will not log any errors if the merge function receives the NFE.  As per below we just set sum = null and return.

./ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java
 public void merge(AggregationBuffer agg, Object partial) throws HiveException {
 .
 .
try {
          myagg.sum = myagg.sum.add(
            PrimitiveObjectInspectorUtils.getHiveDecimal(partial, inputOI));
        } catch (NumberFormatException e) {
          myagg.sum = null;
	}

We had to insert some debug statments in order to get the stack trace and confirm the exception

  2014-11-30 22:14:40,654 ERROR [main] org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum: NFE encountered during merge:java.lang.NumberFormatException: Assignment would result in truncation
	at org.apache.hadoop.hive.common.type.HiveDecimal.(HiveDecimal.java:53)
	at org.apache.hadoop.hive.common.type.HiveDecimal.(HiveDecimal.java:47)
	at org.apache.hadoop.hive.common.type.HiveDecimal.add(HiveDecimal.java:178)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumHiveDecimal.merge(GenericUDAFSum.java:159)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumHiveDecimal.iterate(GenericUDAFSum.java:127)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
	at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:629)
	at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:826)
	at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:723)
	at org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:791)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

Workaround

Cast the column to double

Fix

This behavior has changed in hive 0.13.x as per Hive_Decimal_Precision_Scale_Support. By tuning precision and scale in the table definition we can control how how HiveDecimal rounds off the digits after the decimal.

In addtion the error handling is improved so that the NumberFormatException will be caught and printed to mapper container syslog.

See internal JIRA HD-12052

Comments

Powered by Zendesk