Pivotal Knowledge Base

Follow

Differences in the Compression Ratio for Same Data on Different Clusters

Environment

Pivotal Greenplum Database (GPDB) 4.2.x and 4.3.x

Introduction

This article is to explain why there are differences between the compression ratio for compressed tables with the same data can differ between systems.

Description

To get the compression ratio for a compressed AO table uses SQL command:

select get_ao_compression_ratio('schema.table');

The compression algorithms do not differ between GPDB 4.2.x and 4.3.x and so the compression ratio is the same for the different versions.

The 2 main reasons for differing compression ratios are:

  1. Number of segments
    A difference in the total numbers of segments will most likely change the compression ratio as there will be different amount data to be compressed in each segment.
  2. How the data is loaded
    If data is loaded one row at a time or if it is loaded in batches will affect the resulting compression ratio for the table

Example

The following table definition was used in all the examples below.
The table was also loaded with the same data from a text file.

create TABLE z_test (id int, time timestamp, data text) with (appendonly=true, compresstype=ZLIB) distributed by (id);

The difference is a number of segments:
On a 4.2.x and 4.3.x GPDB system with 16 segments:

test4361_16seg=# copy z_test from '/home/gpadmin/compression/z_test.data';
COPY 182680
test4361_16seg=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     3.98
(1 row)

On a 4.2.x and 4.3.x GPDB system with 8 segments:

test4361=# copy z_test from '/home/gpadmin/compression/z_test.data';
COPY 182680
test4361=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     4.29
(1 row)

The difference in the compression ratio is due to the number of segments in the GPDB system. With a greater number of segments, there are fewer data stored in each segment and so the compression ratio is lower.

The difference in the way the data is loaded:
If the data file is split into 10 separate files with:

split -l 18268 z_test.data

Then load the 10 files separately into the 8 segment DB:

test4361=# truncate z_test;
TRUNCATE TABLE
test4361=# copy z_test from '/home/gpadmin/compression/xaa';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xab';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xac';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xad';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xae';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xaf';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xag';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xah';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xai';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xaj';
COPY 18268
test4361=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     4.28
(1 row)

The compression ratio is slightly different as the data was loaded in batches and not all at once.

Hence, the conclusion is database version does not affect the compression ratio.

Comments

Powered by Zendesk