Pivotal Knowledge Base

Follow

default_statistics_target - Explained

Goal

The below documents explains on how parameter default_statistics_target works with example.

Solution

Definition :

As per the documentation, the definition of "default_statistics_target" is described as, " Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner's estimates ".

Default value / Range :

By default , greenplum set the value to 25 , it can accept values from 1 to 1000. The parameter can be set at session level and doesn't require a restart of the database.

Usage :

From the description/definition mentioned above , it kind of confusing to understand what its used for , in short and in basic term, this parameter control the way the stats are collected , with value 1 being the least estimated/accurate statistics and the value 1000 being the most accurate statistics , obviously with the expense of time / resources ( CPU , memory etc ) / space . Normally the default value is sufficient to get a accurate plan , but if you have a complex data distribution / or a column is referenced in the query quite often , then setting a higher value might help in getting a better statistics on the table and hence a better plan for the optimizer to execute.

Example

To put into practice let take the below example.

  • Current value of the parameter
gpadmin=# show default_statistics_target ;
 default_statistics_target
---------------------------
 25
(1 row)
Time: 0.385 ms

Create a table

gpadmin=# create table test_stats ( a int );
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
Time: 20.562 ms
  • Insert data into it
gpadmin=# insert into test_stats values ( generate_series ( 1,100000));
INSERT 0 100000
Time: 505.107 ms
  • the pg_stats ( view derived from pg_statistics table ) with 25 looks like
gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                        histogram_bounds                                                                        | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {4,3969,8084,12315,16248,20131,24190,28145,32057,35955,39838,43994,47808,52015,56015,59827,63920,67913,72007,75926,79919,83783,87843,91818,95881,99976,100000} |
(1 row)
Time: 27.106 ms

keep an eye on the histogram_bound column change the value of the parameter.

  • default_statistics_target to 100 and reanalyze the table , the pg_stats value now has more buckets of histogram and leading to more accurate prediction.
gpadmin=# set default_statistics_target to 100;
SET
Time: 2.854 ms
gpadmin=# analyze test_stats ;
ANALYZE
Time: 408.362 ms
gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                                                                                                                                                                                                                                                  histogram_bounds                                                                                                                                                                                                                                                                                                   | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {1,1001,2001,3001,4001,5001,6001,7001,8001,9001,10001,11001,12001,13001,14001,15001,16001,17001,18001,19001,20001,21001,22001,23001,24001,25001,26001,27001,28001,29001,30001,31001,32001,33001,34001,35001,36001,37001,38001,39001,40001,41001,42001,43001,44001,45001,46001,47001,48001,49001,50001,51001,52001,53001,54001,55001,56001,57001,58001,59001,60001,61001,62001,63001,64001,65001,66001,67001,68001,69001,70001,71001,72001,73001,74001,75001,76001,77001,78001,79001,80001,81001,82001,83001,84001,85001,86001,87001,88001,89001,90001,91001,92001,93001,94001,95001,96001,97001,98001,99001,100000} |
(1 row)
Time: 9.201 ms

similarly increasing the value of default_statistics_target yield better and accurate stats of the table and its data distribution. If you wish to collect stats on a specific column with specific / constant value , and don't want to alter the stats collection irrespective of what value default_statistics_target is defined , you can define stats method using.

Alter table <table_name> alter column <column_name> set statistics < value from 1 to 1000 > ;

to revert and to analyze to the value defined by default_statistics_target , use

Alter table <table_name> alter column <column_name> set statistics -1 ;

Pros / Cons

Pros Better statistics of the distributed data. Better plan as now the planner has more accurate statistics of the table. Faster and quicker response from the query , since now it has a good plan Cons Increasing the value of the parameter , lead to more time needed for analyze to complete , from my quick test of three values of the parameter with 3 columns yield the below time.

Serial # Value Time
1 25 300ms
2 100 2 sec
3 1000 10 sec

Now to do the extra work of collect statistics it will consume additional memory / CPU to achieve this. pg_statistics will consume more space , since now it has to store more information about the stats of the table in its histogram column.

Comments

Powered by Zendesk