Pivotal Knowledge Base

Follow

Overview of gpsd

Goal

In this document we will take a look at gpsd execution and internals.

Solution

gpsd is a GPDB utility used to collect object metadata and statistics from a source database and import them in to a target database. This essentially produces a replica of the source databases system catalog and statistics which are used in query planning.

gpsd is included in the GPDB distribution under the $GPHOME/bin directory.

Generally the source database is a customers system and the target database is an internal lab system.

The target database can be used to reproduce issues when the optimizer/planner:

  • Produces the wrong plan
  • Crashes while producing a plan
  • Encounters optimizer/planner related issues

Only metadata and statistics are collected. Data stored in the tables is ignored as this would consume too much space and customers generally do not want to share data as it can be confidential.

How To Execute:

The following command should be executed on the source system. This is where the statistics will be extracted from:

gpsd <dbname> > <gpsd_dump_file>
Example:
gpsd customer_prod1 > customer_prod1_20150808.sql

The sql file should be gzipped and transferred to the target database.
To load the data in to the target database:

psql <dbname> -f <gpsd_dump_file>
Example:
creatdb repro_db
psql repro_db -f customer_prod1_20150808.sql

Now the target database contains the same schema and statistics and should produce the same query plan or encounter the same crash as the source database when the specific query is executed.

gpsd also supports skipping the schema and only collecting statistics:

gpsd -s <dbname> > <gpsd_dump_file>
Example:
gpsd -s customer_prod1 > customer_prod1_20150808_updated.sql

This can be useful when you have already collected the schema from the source database and you only need to get updated statistics.

What Happens Internally:

The following steps are performed internally when gpsd is executed:

1. Dump global objects

if orca:
    pg_dumpall -h  -p  -U  -l  -g --no-gp-syntax
else:
    pg_dumpall -h  -p  -U  -l  -g --gp-syntax

2. Dump schema

pg_dump -h  -p  -U  -s -x --gp-syntax -O 

3. Dump tuple counts from pg_class

SELECT
    pgc.relname, pgn.nspname, pgc.relpages, pgc.reltuples
FROM
    pg_class pgc, pg_namespace pgn
WHERE
    pgc.relnamespace = pgn.oid
    AND pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')

4. Dump the statistics from pg_statistic

SELECT
    pgc.relname, pgn.nspname, pga.attname, pgt.typname, pgs.*
FROM
    pg_class pgc, pg_statistic pgs, pg_namespace pgn, pg_attribute pga, pg_type pgt
WHERE
    pgc.relnamespace = pgn.oid and pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')
    AND pgc.oid = pgs.starelid
    AND pga.attrelid = pgc.oid
    AND pga.attnum = pgs.staattnum
    AND pga.atttypid = pgt.oid
ORDER BY
    pgc.relname, pgs.staattnum

 All the output from the above 4 steps outputted to the screen which can be redirected in to an SQL file. 

 

Comments

Powered by Zendesk