What are the different option to check on data skew on tables ?
Below are the different method to check for data skew on a table / relation.
This is the most common and straight forward way to check for skew.
SELECT gp_segment_id, COUNT(*)
GROUP BY gp_segment_id
ORDER BY 1;
This will count the rows per segment and give you a good indication of the skew. This is very useful if you want to know the data distribution of a single table or a list of tables in the database.
The downside to this is, it physically counts the rows for this one table at at time. So you need to write command or build a script to check for skew for all tables in the database.
If you need to check the skew for all tables in the database , the view "gp_skew_coefficients" & "gp_skew_idle_fractions" under the gp_toolkit schema is a better option.
This view can be really time consuming if there are large number of relation and the data size of those relations is very huge ( as it has to scan large number of blocks ).
The alternative method of quickly checking the data skew is by checking the table/relation OS file size on all the segment, this method is described in details in the article and is much quicker when you want to check the data skew of the entire relation in a large database environment.
Once you identify the variation of data size (OS size) you can use the method 1 to further understand on how the data (records) is distributed across various segments.
Downside of this method is if the table is heavily bloated then the invalid tuple size is also taken into consideration, since the DML doesn't release the invalid OS pages hence the OS file size remain unaltered.