Pivotal Knowledge Base

Follow

Greenplum: How to Troubleshoot Long Running and Hanging Backups

Environment 

 Product  Version
 Pivotal Greenplum  4.3.x
 OS  RHEL 6.x

Overview

This article explains how gpcrondump backup works and can help you to troubleshoot a long running backup job. This article explains the various stages of a backup in detail, as well as the associated queries that take place and can be checked in pg_stat_activity catalog view or, the pg_log log file.

The diagram below shows what processes get initialized during a back and how errors are reported for backups:

  1. gpcrondump calls gp_dump
  2. gp_dump spins up backup threads that connect to each segment instance to execute query "SELECT * from gp_backup_luanch(...)"
  3. Once Thread returns, gp_dump creates the report status file and puts the info and error messages for each segment in this file
  4. If any segment reported failure, then gp_dump will exit with status 1
  5. if segments did not report failure but did report errors, then gp_dump returns exit status 2
  6. If no failures and no errors are reported, then exit status is 0

Symptom 

During long running backups, there is not much written into the log file which makes us wonder if the backup is still in progress or is it hanging. In other words, we do not have a status of where the backup is at a given time.

Note: Querying pg_stat_activity for the current SQL that gpcrondump is executing is easier than pg_log, as pg_log might have too many entries at the same time and tracking might be difficult.

Checklist

The first step is to check if the message "Starting Dump process" is on the screen for a long time, and we do not get the subsequent message "Releasing pg_class lock," explained in detail below. If we are stuck in a between stage, it probably means that the gpcrondump is not able to even connect to the master. There would be no status files created and no dump agents started. Refer to this article here to troubleshoot further.

LOCK TABLE pg_catalog.pg_class IN EXCLUSIVE MODE

  1. It'll take an exclusive lock pg_class to backup the system catalog
  2. Tail pg_log or query pg_stat_activity
  3. The catalog backup should be a quick operation, and the exclusive lock on pg_class should be released pretty quickly 
  4. If this stays for a long time, then we need to check if there are any blockers by following the steps outlined in this article: Database Locks

Sample Output

usename | pid | sess_id | waiting | locktype | datname | relname | mode | current_query | query_start 
---------+--------+---------+---------+----------+--------------+----------+---------------+---------------------------------------------------+-------------------
gpadmin | 394305 | 118698 | f | relation | testdb165708 | pg_class | ExclusiveLock | <IDLE> in transaction | 23-APR-2016 13:36
gpadmin | 395907 | 118858 | t | relation | testdb165708 | pg_class | ExclusiveLock | LOCK TABLE pg_catalog.pg_class IN EXCLUSIVE MODE; | 23-APR-2016 13:36
gpadmin 395907 195566 0 13:36 ? 00:00:00 postgres: port 5432, gpadmin testdb165708 127.0.0.1(63156) con118858 127.0.0.1(63156) cmd5 LOCK TABLE waiting

To understand how the locking works in Greenplum, please refer to the Greenplum section on locks from the Admin guide: http://gpdb.docs.pivotal.io/4340/admin_guide/dml.html

LOCK user tables in access share mode

  1. Backup dumps metadata into gp_dump_1_1_<timestamp>.gz and index information into gp_dump_1_1_<timestamp>_post_data.gz
  2. Tail pg_log or query pg_stat_activity should be a very fast operation; object name should constantly change in pg_log or pg_stat_activity
  3. If this stays for a long time, then we need to check if there are any blockers. Refer to this article for the same: Database Locks
  4. After the steps mentioned above, the backup command will be sent to all the segments to start the COPY of the data.
  5. Checks metadata to create table list for backup – through catalog queries, these queries should change fast

Sample Output from gp_dump_1_1_<timestamp>.gz

-- Greenplum Database database dump
SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;
SET default_with_oids = false;
-- Name: AOSTORAGEOPTS; Type: AOSTORAGEOPTS; Schema: -; Owner: 
SET gp_default_storage_options = 'appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row';
-- Name: bdw_elt; Type: SCHEMA; Schema: -; Owner: gpadmin
CREATE SCHEMA bdw_elt;
ALTER SCHEMA bdw_elt OWNER TO gpadmin;
...
...
REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM gpadmin;
GRANT ALL ON SCHEMA public TO gpadmin;
GRANT ALL ON SCHEMA public TO PUBLIC;
-- Greenplum Database database dump complete

Sample Output from gp_dump_1_1_<timestamp>_post_data.gz

-- Greenplum Database database dump
SET statement_timeout = 0;
SET client_encoding = 'SQL_ASCII';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;
SET search_path = udw_ubdq_dm, pg_catalog;
DROP INDEX udw_ubdq_dm.xpkdq_dw_dcsn_cntrl;
DROP INDEX udw_ubdq_dm.dq_etl_cntrl_idx1;
...
...
-- Name: xpkdq_dw_dcsn_cntrl; Type: INDEX; Schema: udw_ubdq_dm; Owner: gpadmin; Tablespace:

CREATE UNIQUE INDEX xpkdq_dw_dcsn_cntrl ON dq_dw_dcsn_cntrl USING btree (src_run_date_id);
-- Greenplum Database database dump complete

COPY test (col1, col2, col3) TO stdout IGNORE EXTERNAL PARTITIONS

  1. Copy table data into gp_dump_0_<dbid>_<timestamp>.gz
  2. Tail pg_log or query pg_stat_activity to see if the table changes relatively fast
  3. If the COPY stays for a long time on one particular table, the most common causes are listed below:
  • Check if there are any blocker: Database Locks
  • Find out which segment backup is still running using gpssh -f <hostfile> then "ps -ef | grep dump_agent | wc -l"
  • Check if pg_attribue on all segments is analyzed, if it is still slow, then check for stats on the entire catalog on the segment running slow backup: gpcrondump is Hung on Some Segments
  • Check to see if any table has too many partitions, refer to How To Access Partition Information
  • Vacuum tables to see if there is a bloat. Refer to the article: Identify Tables that need Vaccum
  • Check for any entries in pg_hba.conf which have a localhost entry with a md5 or a password authentication that needs to be trusted. Refer to the article: pg hba.conf
  • Check if any tables have data skew: Check data skew

Check OS Level

Process Tracing

  1. Make sure process on OS is not hung.
  2. Read the article about strace on the backup process to troubleshoot: Troubleshoot Hung Statement.
  3. For strace, the output should show activity that is not repeating.
  4. If the output is repeating, then the process is probably stuck. For the above example of an exclusive lock on pg_class, we will see something like below, and the strace would be stuck or in a loop:
[gpadmin@gpdb-sandbox ~]$ strace -p 395907
Process 395907 attached
semop(24314069, {{1, -1, 0}}, 1

Resource Utilization

  1. Make sure all system resources like memory, CPU, and I/O are efficient.
  2. sar -P ALL 1 2 (CPU) ; sar -r 1 3 (Memory and swap) ; sar -b 1 3 (Disk I/O).
  3. None of the resources should be over utilized.
  4. 100% CPU/memory and excessive disk I/O is not good.

Sample output

[root@gpdb-sandbox ~]# sar -P ALL 1 1
Linux 2.6.32-573.el6.x86_64 (gpdb-sandbox.localdomain) 04/23/2016 _x86_64_ (2 CPU) 01:44:21 PM CPU %user %nice %system %iowait %steal %idle
01:44:22 PM all 0.00 0.00 2.01 0.00 0.00 97.99
01:44:22 PM 0 0.00 0.00 4.00 0.00 0.00 96.00
01:44:22 PM 1 0.00 0.00 0.99 0.00 0.00 99.01
[root@gpdb-sandbox ~]# sar -r 1 1
Linux 2.6.32-573.el6.x86_64 (gpdb-sandbox.localdomain) 04/23/2016 _x86_64_ (2 CPU) 01:45:28 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
01:45:29 PM 228852 7817136 97.16 158548 7147320 2072928 20.44
Average: 228852 7817136 97.16 158548 7147320 2072928 20.44
[root@gpdb-sandbox ~]# sar -b 1 1
Linux 2.6.32-573.el6.x86_64 (gpdb-sandbox.localdomain) 04/23/2016 _x86_64_ (2 CPU) 01:46:00 PM tps rtps wtps bread/s bwrtn/s
01:46:01 PM 0.00 0.00 0.00 0.00 0.00
Average: 0.00 0.00 0.00 0.00 0.00

Check Network Level

In order to troubleshoot the network component please refer to Troubleshooting the network for long running /backups/jobs/queries

Check Hardware level

In order to troubleshoot the network component please refer to Troubleshooting the hardware for long running /backups/jobs/queries

Open ticket with Data Domain Support

If all of the above have been verified and documented in the ticket, open a service request with EMC Support for Data Domain product. This is the basic information needed to be documented in the ticket:

  • Does the Data Domain has at least 20% free space?
  • Are there any alerts on the DDR?
  • Is there any packet loss reported on the DD side?
  • Are the number of links from the all the segments to the DD’s configured the same?
  • Are all the GPDB nodes configured the same to the DD?
  • How many streams in use on the DDR (Are we using too many)?

Note: In case there are random segments falling behind from specific server during backup then a reboot of the node can help in fixing the issue. You can stop the database , reboot the node and start the database again to follow this process.

Additional Links

For further information on the files that gpcrondump creates on the Master and the Segments, please refer to gpcrondump file layouts.

To test the Data Domain ddboost backup speed, use the ddpconnchk tool: How to check DDboost Speed.

Related Information

Refer to Troubleshooting Slow Running Restore if the restore was also causing performance issues.

Here is the Python script "gpbackuptime_collector" to extract the time taken by each step in the gpcrondump process (This is not real time but after the backup completes)

For the information about the tool: Greenplum Backup Time Collector.

For more information see How to read the output from the gpbackuptime collector.

Comments

Powered by Zendesk