Pivotal Knowledge Base

Follow

How to increase the performance of the gp_toolkit.gp_skew_coefficients

Environment

Product Version
Pivotal Greenplum All Versions

Problem

If the GPDB has more than 4000+ relations then gp_skew_coefficients for a single table would take for ever to return results, often this leads to users cancelling using ctrl+c or through pg_cancel_backend / pg_terminate_backend.

If the solution described below is also slow to extract the information, would recommend to have a look here at a alternative way to calculate skew.

Cause

Inside function gp_skew_coefficients function, there is a for loop searching for table from - gp_toolkit.__gp_user_data_tables_readable which takes a long time to look for a table.

An alternative way is to check/run count(*) on the table to check the distribution of a table.

SELECT gp_segment_id, count(*) FROM table_name GROUP BY gp_segment_id;

But again this would take a huge amount if time and a lot of manual task.

Solution

Altering the function to the one in the below example gives a better response time and provide result quickly.

-- Create a function

set SEARCH_PATH='gp_toolkit';

CREATE OR REPLACE FUNCTION gp_toolkit.sp_gp_skew_coefficients(schemanm varchar(200), tablename varchar(300)) RETURNS SETOF gp_skew_analysis_t
 	AS $$
 DECLARE
 	skcoid oid;
 	skcrec record;
BEGIN
 	SELECT autoid INTO skcoid 
 	FROM
		gp_toolkit.__gp_user_data_tables_readable 
 	WHERE autrelstorage not in ('x','v') 
	AND autnspname = schemanm 
	AND autrelname = tablename;
 	
 	SELECT * INTO skcrec
 	FROM
 		gp_toolkit.gp_skew_coefficient(skcoid); 
 	RETURN next skcrec;
END
 	$$
LANGUAGE plpgsql;

-- Now using the function for retrieving a specific table, use the below query.

Replace '<schema-name>','<table-name>' in the below query with schemaname.tablename that you want to know the skew details of.

select 
	skew.skewoid AS skcoid, 
	pgn.nspname AS skcnamespace, 
	pgc.relname AS skcrelname, 
	skew.skewval AS skccoeff 
from 
	gp_toolkit.sp_gp_skew_coefficients('<schema-name>','<table-name>') skew(skewoid, skewval) 
JOIN pg_class pgc ON skew.skewoid = pgc.oid 
JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid;

-- For all the list of table (At Unix Prompt)

awk 'BEGIN { FS="|"; printf "%-10s | %-20s | %-30s | %s\n", "skcoid", "skcnamespace" ,"skcrelname" , "skccoeff"
            printf "%-10s | %-20s | %-30s | %s\n", "--------","------------","----------", "--------"}' ; psql -Atc "select tablename,schemaname from pg_tables where schemaname not in ('information_schema','pg_catalog','gp_toolkit') and tablename not in ( select relname from pg_class where relstorage in ('x','v'))"  | while read line;

do
export relname=`echo $line | cut -d'|' -f1` export relnamespace=`echo $line | cut -d'|' -f2`
psql -Atc "select skew.skewoid AS skcoid, pgn.nspname AS skcnamespace, pgc.relname AS skcrelname, skew.skewval AS skccoeff from gp_toolkit.sp_gp_skew_coefficients('"${relnamespace}"','"${relname}"') skew(skewoid, skewval) JOIN pg_class pgc ON skew.skewoid = pgc.oid JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid
-- AND skew.skewval > x
; " | grep -v "row" done | sort -n -t '|' -k4 -r | while read line;
do
echo $line | awk -F '|' '{printf "%-10s | %-20s | %-30s | %s\n", $1,$2,$3,$4}'
done

Tip:

  • If you need top 10 highly skewed table , use "head -10" at the end of the query, like
[......]
[......]
done  | sort -n -t '|' -k4 -r | while read line;
do
echo $line | awk -F '|' '{printf "%-10s | %-20s | %-30s | %s\n", $1,$2,$3,$4}'
done | head -10
  • If you need all the relation that has a skew coefficient value greater than x , you can uncomment the line ( -- AND skew.skewval > x ) in the above script and place the value of x you would be interested in, for. eg.s
[......]
JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid
AND skew.skewval > 100
[......]

Comments

Powered by Zendesk