Pivotal Knowledge Base

Follow

Greenplum: How to Convert Timestamp Data Type in Hive Parquet File back to Timestamp Format

Environment

 Product  Version
 Pivotal Greenplum (GPDB)  4.3.x
 OS  RHEL 6.x

Purpose

This article is to provide a way to convert timestamp data type in Hive generated Parquet schema back to timestamp format in Greenplum via a gphdfs protocol.

Cause

As shown by the following description in the Greenplum doc, timestamp data type in Hive generated Parquet schema will be converted to bytea data type by the gphdfs protocol in Greenplum.

  • For the Hive timestamp data type, the Hive generated a Parquet schema for the data type specifies that the data is stored as data type int96. The gphdfs protocol converts the int96 data type to the Greenplum Database bytea data type.

So the converted data of timestamp format in Hive generated a Parquet schema that is not shown in a timestamp format if directly queried out via gphdfs protocol in Greenplum, which is illustrated as follows.

1. A Parquet table in Hive containing timestamp data type:

0: jdbc:hive2://localhost:10000> select * from parquet3a;
+-------------------------+---------------------------+---------------------+-----------------------+--+
| parquet3a.test_varchar | parquet3a.test_timestamp | parquet3a.test_int | parquet3a.test_float |
+-------------------------+---------------------------+---------------------+-----------------------+--+
| test | 2016-12-07 00:01:00.0 | 1 | 0.10000000149011612 |
| test2 | 2016-12-07 00:02:00.0 | 2 | 0.20000000298023224 |
+-------------------------+---------------------------+---------------------+-----------------------+--+

2. The Hive Parquet table data is loaded with gphdfs protocol in Greenplum and shows unrecognizable characters for columns of timestamp data type:

gpadmin=# \d parquet3_ext1
External table "public.parquet3_ext1"
Column | Type | Modifiers
----------------+-----------------------+-----------
test_varchar | character varying(20) |
test_timestamp | bytea |
test_int | integer |
test_float | real |
Type: readable
Encoding: UTF8
Format type: parquet
Format options: formatter 'gphdfs_import'
External location: gphdfs://hdp24a/apps/hive/warehouse/parquet3a gpadmin=# select * from parquet3_ext1;
test_varchar | test_timestamp | test_int | test_float
--------------+-----------------------------------------+----------+------------
test2 | \000\260\030\374~4\000\000\201\200%\000 | 2 | 0.2
test | \000X\321\003q4\000\000\201\200%\000 | 1 | 0.1
(2 rows)

Procedure

As mentioned above, the gphdfs protocol converts Hive timestamp data type to bytea data type, which is actually an array of 12 bytes made of 2 parts:

  • byte 7~byte 0: Time elapsed in the day in nanoseconds.
  • byte 11~byte8: Julian date.

Based on this information, a UDF could be created in Greenplum to convert the 12-byte's timestamp data to Epoch date.

C Function

The attached C source file con_bytea.c contains a function to implement the conversion. It is built into a shared object library which could be used to create a UDF in Greenplum. 

1. Put the shared object library file to $GPHOME/lib/postgresql

2. Create a UDF in Greenplum referring to C function in the shared object library. Note that function bytea2ts will return an Epoch date in milliseconds:

gpadmin=# CREATE FUNCTION bytea2ts(bytea) RETURNS bigint
gpadmin-# AS '$libdir/gp_covhpts.so', 'bytea2ts'
gpadmin-# LANGUAGE C STRICT;
CREATE FUNCTION
gpadmin=#

3. Use the defined UDF to query Hive Parquet table data with gpdhfs protocol again and the same value for the test_timestamp column is shown as in Hive is returned. Keep in mind that the timezone might need to be adjusted for your own location:

gpadmin=# select test_varchar, to_timestamp(bytea2ts(test_timestamp)/1000) AT TIME ZONE '-8:00' as test_timestamp from parquet3_ext1;
test_varchar | test_timestamp
--------------+---------------------
test2 | 2016-12-07 00:02:00
test | 2016-12-07 00:01:00
(2 rows)

PL/Python Function

1. Make sure the language plpythonu has already been created in the database. Refer to the Greenplum doc for how to create the language

2. Create a Python function as shown below:

drop function if exists py_bytea2ts(bytea);
CREATE FUNCTION py_bytea2ts (ts bytea)
RETURNS bigint
AS $$ JULIAN_EPOCH_OFFSET_DAYS = 2440588
MILLIS_IN_DAY = 24*3600*1000

timeOfDayNanos = [ts[7], ts[6], ts[5], ts[4], ts[3], ts[2], ts[1], ts[0]]
julianDay = [ts[11], ts[10], ts[9], ts[8]] timeOfDayNanos_ba = bytearray(timeOfDayNanos)
julianDay_ba = bytearray(julianDay) timeOfDayNanos_encoded = str(timeOfDayNanos_ba).encode('hex')
julianDay_encoded = str(julianDay_ba).encode('hex') timeOfDayNanos_l = long(timeOfDayNanos_encoded, 16)
julianDay_l = long(julianDay_encoded, 16) con_time = (julianDay_l - JULIAN_EPOCH_OFFSET_DAYS)*MILLIS_IN_DAY + timeOfDayNanos_l/1000000; return con_time $$ LANGUAGE plpythonu;

3. Run the same query as being used to verify the C function and you could get the same result:

gpadmin=# select test_varchar, to_timestamp(py_bytea2ts(test_timestamp)/1000) AT TIME ZONE '-8:00' as test_timestamp from parquet3_ext1;
test_varchar |test_timestamp
--------------+---------------------
test2 | 2016-12-07 00:02:00
test | 2016-12-07 00:01:00
(2 rows) 


Additional Information

Comments

  • Avatar
    Sangdon Shin

    Excellent article!

Powered by Zendesk