Pivotal Knowledge Base

Follow

Data skew in randomly distributed Pivotal HD tables

Environment

Product Version
Pivotal HD (HAWQ) 1.2.x, 1.3.x

Symptom

Inserting or redistributing data on a randomly distributed table with the Pivotal Query Optimizer (ORCA) enabled results in data skew.

[gpadmin@hdw1 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# show optimizer;
 optimizer
-----------
 on
(1 row)

gpadmin=# create table test_columns (id int,  testdata char(39)) with (appendonly=true, orientation=parquet) distributed randomly;
CREATE TABLE
gpadmin=# create external table ext_test_columns (like test_columns) LOCATION ('pxf://localhost:8020/tmp/hawq_testdata/test_data.csv?Profile=HdfsTextSimple') FORMAT 'CSV';
CREATE EXTERNAL TABLE
gpadmin=# explain analyze insert into test_columns SELECT * from ext_test_columns;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Insert  (cost=0.00..21281.23 rows=166667 width=16)
   Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 166 ms to first row, 13583 ms to end, start offset by 249 ms.
   Executor memory:  1K bytes avg, 1K bytes max (seg0).
   ->  Result  (cost=0.00..447.89 rows=166667 width=32)
         Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 159 ms to first row, 8947 ms to end, start offset by 249 ms.
         ->  External Scan on ext_test_columns  (cost=0.00..437.60 rows=166667 width=16)
               Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 159 ms to first row, 7440 ms to end, start offset by 249 ms.
 Slice statistics:
   (slice0)    Executor memory: 3251K bytes avg x 6 workers, 9449K bytes max (seg0).
 Statement statistics:
   Memory used: 128000K bytes
 Total runtime: 14419.207 ms
(12 rows)

gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id |  count
---------------+---------
             0 | 3222116
             2 | 1777884
(2 rows)

gpadmin=# alter table test_columns set  WITH (reorganize=true) distributed randomly;
ALTER TABLE
gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id |  count
---------------+---------
             0 | 3222116
             2 | 1777884
(2 rows)

Cause

The query optimizer does not perform a redistribute motion when adding or redistributing data in randomly distributed HAWQ tables.

Resolution

Disable the query optimizer before inserting data into a randomly distributed table.
 
If the data has already been inserted into the table, disable the optimizer and alter the table as shown below:
gpadmin=# set optimizer=off;
SET
gpadmin=# alter table test_columns set  WITH (reorganize=true) distributed randomly;
ALTER TABLE
gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id | count
---------------+--------
             1 | 833354
             4 | 833354
             2 | 833358
             0 | 833288
             3 | 833277
             5 | 833369
(6 rows)

Comments

Powered by Zendesk