Pivotal Greenplum Database (GPDB)
This article describes the different option to check data skew on tables.
The different method to check for data skew on a table/relation.
This is the most common and straightforward way to check for skew.
SELECT gp_segment_id, COUNT(*)
GROUP BY gp_segment_id
ORDER BY 1;
This will count the rows per segment and give you a good indication of the skew. This is very useful if you want to know the data distribution of a single table or a list of tables in the database.
The downside to this is, it physically counts the rows for this one table at a time. So you need to write command or build a script to check for skew for all tables in the database.
If you need to check the skew for all tables in the database, the view "gp_skew_coefficients" & "gp_skew_idle_fractions" under the gp_toolkit schema is a better option.
This view can be really time-consuming if there are a large number of relation and the data size of those relations is huge ( as it has to scan a large number of blocks ).
The alternative method of quickly checking the data skew is by checking the table/relation OS file size on all the segment, this method is described in details in the article and is much quicker when you want to check the data skew of the entire relationship in a large database environment.
Once you identify the variation of data size (OS size) you can use the method 1 to further understand how the data (records) is distributed across various segments.
The downside of this method is if the table is heavily bloated then the invalid tuple size is also taken into consideration since the DML doesn't release the invalid OS pages hence the OS file size remain unaltered.