How to check data skew across segments on v4.x
In Greenplum v4.x the Administrative Schema gp_toolkit can be used to check for data skew. The gp_toolkit schema contains a number of views that you can access using SQL commands. For convenience, you may want to add the gp_toolkit schema to your schema search path. For example:
ALTER ROLE myrole SET search_path TO myschema,gp_toolkit;
The following views can help diagnose if a table has uneven data distribution(Data Skew): gp_skew_coefficients This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The lower the value, the better.
SELECT * FROM gp_toolkit.gp_skew_coefficients;
The view "gp_skew_idle_fractions" shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew.
SELECT * FROM gp_toolkit.gp_skew_idle_fractions
Once you have the table names with the high skccoeff values from the gp_skew_coefficients query, you can then check the data distribution across the Segment servers. Please refer to article