Pivotal Knowledge Base

Follow

Differences in compression ratio for same data on different clusters

Environment

Product Version
Pivotal Greenplum (GPDB) 4.2.x and 4.3.x

Purpose

This article is to explain why there can be differences between the compression ratio for compressed tables with the same data can differ between systems.

Cause

To get the compression ratio for a compressed AO table use SQL command:

select get_ao_compression_ratio('schema.table');

The compression algorithms do not differ between GPDB 4.2.x and 4.3.x and so the compression ratio is the same for the different versions.

The 2 main reasons for differing compression ratios are:

  1. Number of segments
    A difference in the total numbers of segments will most likely change the compression ratio as there will be different amount data to be compressed in each segment.
  2. How the data is loaded
    If data is loaded one row at a time or if it is loaded in batches will affect the resulting compression ratio for the table

 Example

The following table definition was used in all the examples below.
The table was also loaded with the same data from a text file.

create TABLE z_test (id int, time timestamp, data text) with (appendonly=true, compresstype=ZLIB) distributed by (id);

Difference is number of segments:
On a 4.2.x and 4.3.x GPDB system with 16 segments:

test4361_16seg=# copy z_test from '/home/gpadmin/compression/z_test.data';
COPY 182680
test4361_16seg=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     3.98
(1 row)

On a 4.2.x and 4.3.x GPDB system with 8 segments:

test4361=# copy z_test from '/home/gpadmin/compression/z_test.data';
COPY 182680
test4361=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     4.29
(1 row)

The difference in the compression ratio is due to the number of segments in the GPDB system. With a greater number of segments, there is less data stored in each segment and so the compression ratio is lower.

Difference in the way the data is loaded:
If the data file is split into 10 separate files with:

split -l 18268 z_test.data

Then load the 10 files separately into the 8 segment DB:

test4361=# truncate z_test;
TRUNCATE TABLE
test4361=# copy z_test from '/home/gpadmin/compression/xaa';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xab';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xac';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xad';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xae';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xaf';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xag';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xah';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xai';
COPY 18268
test4361=# copy z_test from '/home/gpadmin/compression/xaj';
COPY 18268
test4361=# select get_ao_compression_ratio('z_test');
 get_ao_compression_ratio 
--------------------------
                     4.28
(1 row)

The compression ratio is slightly different as the data was loaded in batches and not all at once.

Hence, the conclusion is database version does not affect the compression ratio.

Comments

Powered by Zendesk