Pivotal Knowledge Base

Follow

default_statistics_target - Explained

Environment

Pivotal Greenplum Database (GPDB) all versions

Introduction

The below documents explains how parameter default_statistics_target works with an example.

Description

Definition:

As per the documentation, the definition of "default_statistics_target" is described as, " Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner's estimates ".

Default value/Range:

By default, Greenplum set the value to 25, it can accept values from 1 to 1000. The parameter can be set at the session level and doesn't require a restart of the database.

Usage:

From the description/definition mentioned above , it kind of confusing to understand what its used for, in short, and in basic term, this parameter controls the way the stats are collected, with value 1 being the least estimated/accurate statistics and the value 1000 is the most accurate statistics, obviously with the expense of time/resources (CPU, memory etc ) / space . Normally the default value is sufficient to get an accurate plan, but if you have a complex data distribution / or a column is referenced in the query quite often, then setting a higher value might help in getting a better statistics on the table and hence a better plan for the optimizer to execute.

Example

To put into practice let take the below example.

  • The current value of the parameter
gpadmin=# show default_statistics_target ;
 default_statistics_target
---------------------------
 25
(1 row)
Time: 0.385 ms

Create a table

gpadmin=# create table test_stats ( a int );
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
Time: 20.562 ms
  • Insert data into it
gpadmin=# insert into test_stats values ( generate_series ( 1,100000));
INSERT 0 100000
Time: 505.107 ms
  • the pg_stats ( view derived from pg_statistics table ) with 25 looks like
gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                        histogram_bounds                                                                        | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {4,3969,8084,12315,16248,20131,24190,28145,32057,35955,39838,43994,47808,52015,56015,59827,63920,67913,72007,75926,79919,83783,87843,91818,95881,99976,100000} |
(1 row)
Time: 27.106 ms

keep an eye on the histogram_bound column change the value of the parameter.

  • default_statistics_target to 100 and reanalyze the table, the pg_stats value now has more buckets of histogram and leading to more accurate prediction.
gpadmin=# set default_statistics_target to 100;
SET
Time: 2.854 ms
gpadmin=# analyze test_stats ;
ANALYZE
Time: 408.362 ms
gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                                                                                                                                                                                                                                                  histogram_bounds                                                                                                                                                                                                                                                                                                   | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {1,1001,2001,3001,4001,5001,6001,7001,8001,9001,10001,11001,12001,13001,14001,15001,16001,17001,18001,19001,20001,21001,22001,23001,24001,25001,26001,27001,28001,29001,30001,31001,32001,33001,34001,35001,36001,37001,38001,39001,40001,41001,42001,43001,44001,45001,46001,47001,48001,49001,50001,51001,52001,53001,54001,55001,56001,57001,58001,59001,60001,61001,62001,63001,64001,65001,66001,67001,68001,69001,70001,71001,72001,73001,74001,75001,76001,77001,78001,79001,80001,81001,82001,83001,84001,85001,86001,87001,88001,89001,90001,91001,92001,93001,94001,95001,96001,97001,98001,99001,100000} |
(1 row)
Time: 9.201 ms

similarly increasing the value of default_statistics_target yield better and accurate stats of the table and its data distribution. If you wish to collect stats on a specific column with specific/constant value and don't want to alter the stats collection irrespective of what value default_statistics_target is defined, you can define stats method using.

Alter table <table_name> alter column <column_name> set statistics < value from 1 to 1000 > ;

to revert and to analyze to the value defined by default_statistics_target , use

Alter table <table_name> alter column <column_name> set statistics -1 ;

Pros and Cons

Pros- Better statistics of the distributed data. The better plan as now the planner has more accurate statistics of the table. Faster and quicker response from the query, since now it has a good plan

Cons- Increasing the value of the parameter, leading to more time needed for analyze to complete, from my quick test of three values of the parameter with 3 columns yield the below time.

Serial # Value Time
1 25 300ms
2 100 2 sec
3 1000 10 sec

Now to do the extra work of collect statistics it will consume additional memory / CPU to achieve this. pg_statistics will consume more space, since now it has to store more information about the stats of the table in its histogram column.

Comments

Powered by Zendesk