Pivotal Knowledge Base

Follow

GPCRONDUMP is Hung/Slow on some Segments or after Expansion

Environment

 Product  Version
 Pivotal Greenplum  All
 OS  RHEL 6.x

Symptom

You encountered gpcrondump being slow (slower backup) on some segments during normal database backup operation (i.e. dumping of data is slower on one segment compared to the rest) or after expansion of the cluster.

In some of the status of the segments, you will find indications that the batch jobs are taking a couple of minutes to execute the query before it can actually begin to start the dump process:

[........]
[........]
20140911:01:40:36|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:40:36|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:41:37|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:41:37|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:41:17|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:41:19|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:41:19|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:51:19|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:52:10|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:52:10|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:52:10|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:52:12|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:52:12|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:52:12|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:52:14|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:52:04|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:52:04|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:53:05|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:53:05|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:53:05|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:56:07|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:56:07|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:56:07|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:56:08|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:56:08|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:56:08|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
20140911:01:56:00|gp_dump_agent-[INFO]:-Finished processing batch 1 of tables
20140911:01:57:30|gp_dump_agent-[INFO]:-Converting tablenames to oids
20140911:01:57:30|gp_dump_agent-[INFO]:-Processing tables in batches of 1000
[........]
[........]

Cause

The issue is caused due to bad statistics of catalog tables in the segments.

During normal database operation, the plan for the queries is generated at the master, so a normal analyze that has been running from the master (connected normally from PSQL) would only update the master database catalog.

But in case of gpcrondump, in order to use the Greenplum parallelism feature and for quicker backup, it login's into each segment via PGOPTIONS and uses the segments catalog to generate the plan and then fetch the data.

But when the catalog table pg_attributes is huge, fetching information from pg_attribute would be slow due to bad statistics and thus result in slow performance.

Resolution

Analyze the pg_attributes table of all the primary segments (you can use the query below) and re-run gpcrondump.

NOTE: Please ensure you replace the value <database-name> with the database name you are taking the backup.

psql -Atc "select hostname,port from gp_segment_configuration where role='p' and content<>-1" | while read line;
do
export hostname=`echo $line | cut -d'|' -f1`
export port=`echo $line | cut -d'|' -f2`
echo "|--- Analyzing" $hostname ": port" $port " ---|"
PGOPTIONS='-c gp_session_role=utility' psql -d <database-name> -p $port -h $hostname -xtc "analyze pg_attribute"
PGOPTIONS='-c gp_session_role=utility' psql -d <database-name> -p $port -h $hostname -xtc "analyze pg_attribute_encoding;"
done

IMPORTANT: Make sure you run the analyze on a quiet system in order to avoid any performance issues to the database.

If after analyzing, the start of dump process in the segments is still slow, it is suggested to analyze all the database catalog (sample script below) on all the segments and re-running the backup.

cat /dev/null > /tmp/analyzetable.sql | psql <database-name> -Atc "select \$\$analyze pg_catalog.\$\$||tablename ||\$\$;\$\$ from pg_tables where schemaname='pg_catalog'" >/tmp/analyzetable.sql ; psql -Atc "select hostname,port from gp_segment_configuration where role='p' and content<>-1" | while read line;
do
export hostname=`echo $line | cut -d'|' -f1`
export port=`echo $line | cut -d'|' -f2`
echo "|--- Analyzing" $hostname ": port" $port " ---|"
PGOPTIONS='-c gp_session_role=utility' psql -a -d <database-name> -p $port -h $hostname -xtf /tmp/analyzetable.sql 
done

Comments

  • Avatar
    Kushal Choubay

    In some cases if cluster has large number of rows in "pg_attribute_encoding" and "pg_attrdef" tables then also metadata phase will take long time on the segments. ANALYZE needs to be done on segments for these tables as well to overcome the issue.

     

    Edited by Kushal Choubay
  • Avatar
    Shawn Yan

    Found if the backup already started, we can directly apply this workaround, without restart the backup.

Powered by Zendesk