Pivotal Knowledge Base

Follow

Pivotal HDB PXF May return stale data when reading external hive avro tables

Environment

  • Pivotal HDB 1.3.0.x
  • Pivotal HDB 1.2.x

Symptom

Hawq PXF May return stale data when reading external hive avro tables that using "byte" column time

hawq> select * from my_avro_table;
100000001 | [1,2] | 1.1 | [0.1,0.2,0.3] | testing all supported types in AvroResolver 1 | [and,also,arrays] | 1.1 | [0.1,0.2,0.3333] | 100000000001 | [200000000002,400000000004] | 12 | [\\062\\063,\\063] | t | [true,false,false] -200000002 | [-1,-2] | -2.2 | [-0.1,-0.2,-0.3] | testing all supported types in AvroResolver 2 | [and,arrays,too] | -2.2 | [-0.1,0.2,-0.3333] | -200000000002 | [-300000000003,-500000000005] | 32 | [\\063\\063,\\062] | f | [false,false,true] -200000002 | [-1,-2] | -2.2 | [-0.1,-0.2,-0.3] | testing all supported types in AvroResolver 2 | [and,arrays,too] | -2.2 | [-0.1,0.2,-0.3333] | -200000000002 | [-300000000003,-500000000005] | 44 | [\\064\\065,\\062] | f | [false,false,true]
hive> select * from my_avro_table;
OK
100000001    [1,2]    1.1    [0.1,0.2,0.3]    testing all supported types in AvroResolver 1    ["and","also","arrays"]    1.1    [0.1,0.2,0.3333]    100000000001    [200000000002,400000000004]    12    [2,3]    true    [true,false,false]
-200000002    [-1,-2]    -2.2    [-0.1,-0.2,-0.3]    testing all supported types in AvroResolver 2    ["and","arrays","too"]    -2.2    [-0.1,0.2,-0.3333]    -200000000002    [-300000000003,-500000000005]    3    [3,2]    false    [false,false,true]
-200000002    [-1,-2]    -2.2    [-0.1,-0.2,-0.3]    testing all supported types in AvroResolver 2    ["and","arrays","too"]    -2.2    [-0.1,0.2,-0.3333]    -200000000002    [-300000000003,-500000000005]    44    [3,2]    false    [false,false,true]

Reproduction steps

  1. Create Hive table
    hive> CREATE external TABLE my_avro_table
        >   ROW FORMAT SERDE
        >   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
        >   WITH SERDEPROPERTIES (
        >     'avro.schema.url'='file:///tmp/array.avsc')
        >   STORED as INPUTFORMAT
        >   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
        >   OUTPUTFORMAT
        >   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
    OK
    Time taken: 0.675 seconds
    
    hive> desc my_avro_table;
    OK
    type_int                int                     from deserializer
    type_int_array          array              from deserializer
    type_double             double                  from deserializer
    type_double_array       array           from deserializer
    type_string             string                  from deserializer
    type_string_array       array           from deserializer
    type_float              float                   from deserializer
    type_float_array        array            from deserializer
    type_long               bigint                  from deserializer
    type_long_array         array           from deserializer
    type_bytes              binary                  from deserializer
    type_bytes_array        array           from deserializer
    type_boolean            boolean                 from deserializer
    type_boolean_array      array          from deserializer
    Time taken: 0.2 seconds, Fetched: 14 row(s)
  2. Load data
    hive> LOAD DATA LOCAL INPATH 'file:///tmp/array.avro' INTO TABLE my_avro_table;
    Loading data to table default.my_avro_table
    Table default.my_avro_table stats: [numFiles=2, numRows=0, totalSize=4922, rawDataSize=0]
    OK
    Time taken: 0.611 seconds

Cause

Stale data is carried over within the column(or padded) from the previous row containing extra bytes. The result is we have data from adjacent columns leaking into the "byte" column

Workaround

We can workaround this by enforcing fields with data types 'bytes' to use the same number of bytes across all rows

Fix

This will be fixed in Pivotal HDB 1.3.1.0

 

Comments

Powered by Zendesk