Pivotal Knowledge Base

Follow

Reindex causes losing stats and performance issues

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
OS RHEL 6.x
Others  

Symptom

  • Planning phase of a query is taking a long time
  • Index missing stats in pg_class

See the examples below:

- Create a table

flightdata=# create table test11 as select * from pg_attribute;
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'attrelid' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 3706

- Create an index

flightdata=# create index test11_idx on test11 using bitmap (attrelid);
CREATE INDEX

- Analyze the table

flightdata=# analyze test11;
ANALYZE

- Both table and index have valid statistics

flightdata=# select relname,relfilenode,relkind,relpages,reltuples from pg_class where relname like 'test11%';
  relname   | relfilenode | relkind | relpages | reltuples 
------------+-------------+---------+----------+-----------
 test11     |       17158 | r       |       13 |      3001
 test11_idx |       17184 | i       |       48 |      3001
(2 rows)

- Reindex the table

flightdata=# reindex table test11;
REINDEX

- Index is recreated and index statistics are reset

flightdata=# select relname,relfilenode,relkind,relpages,reltuples from pg_class where relname like 'test11%';
  relname   | relfilenode | relkind | relpages | reltuples 
------------+-------------+---------+----------+-----------
 test11     |       17158 | r       |       13 |      3001
 test11_idx |       17194 | i       |        0 |         0
(2 rows)

Now when a query is running and object related to the query does not have statistics, the database will attempt to collect an approximate data about the object size by running pg_relation_size(). This command is dispatched and will take some time. The segment logs report the below statement with "log_duration" and "log_min_duration_statement" is set to "on" and "-1" respectively.

2016-01-19 05:56:26.998951 CST,"gpadmin","flightdata",p14106,th512821024,"127.0.0.1","63772",2016-01-19 05:56:26 CST,0,con13,cmd328,seg1,,,,,"LOG","00000","duration: 1.104 ms",,,,,,"select pg_relation_size('public.test11_idx')",0,,"postgres.c",1814,

Cause 

Query planning is done at the master and then it's dispatched to the segments for execution. While preparing the plan, the optimizer notices an object without statistics, it attempts to collect approximate statistics or size by running the following code:

select pg_relation_size(<relation_name>) ; 

This is an expected behaviour. REINDEX command resets index statistics and they are not collected until the next ANALYZE command is running on the object.

Resolution

  • Analyze the table immediately after an index is rebuilt in order to populate the index statistics.
  • Ensure when running maintenance on the table the below sequence of steps are followed.
vacuum <table-name>
reindex <table-name>
analyze <table-name>

 

Comments

Powered by Zendesk