Pivotal Knowledge Base

Follow

How to Check xlog apply progress during crash recovery (Startup Pass 3)

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
OS RHEL 6.x
Others  

Symptom

  • Greenplum Database cannot be connected during the auto recovery phase of applying the xlog after a database crash.
[gpadmin@avw7hdm2p1 ~]$ psql
psql: FATAL: the database system is starting up
  • DBA is unable to estimate the time needed for this phase.
  • Master gets stuck due to applying redo logs for a long time.
  • The "startup pass 3 process" process is running and CPU time keeps increasing.
[gpadmin@avw7hdm2p1 gpseg-1]$ ps -ef|grep -i post |grep process
gpadmin 9841 9840 0 Feb05 ? 00:00:24 postgres: port 5432, master logger process
gpadmin 365107 9840 33 05:38 ? 02:18:09 postgres: port 5432, startup pass 3 process

Resolution

Here is a way to help us get estimated time for the recovery process.

  1. From the master log, we can get the redo location. In the following example, 322/F26EF040 is the current redo location which is being applied.
REDO PASS 3 @ 322/F26EF040; LSN 322/F26F6AE0: prev 322/F26EEF88;
  1. Connect to any working GPDB instance and run the following query to get the xlog file name. 
liang=# select pg_xlogfile_name('322/F26EF040');
pg_xlogfile_name
--------------------------
00000001000003220000003C

We can know that 322/F26EF040 is in the xlog "00000001000003220000003C".

  1. Check the $MASTER_DATA_DIRECTORY/pg_xlog folder which has all the xlog files. The xlogs are applied sequentially according to the last modified time so we are able to know how many xlogs have been applied and how many left. 

NOTE: Until all the xlogs are applied to the database the database will not accept the connection.

 

Alternative method

First check the PID of the recovery process:

[gpadmin@avw7hdm2p1 gpseg-1]$ ps -ef | grep postgres
gpadmin 744935 1 0 03:56 ? 00:00:03 /usr/local/hawq-1.2.1.3/bin/postgres -D /data/hawq/master_new/gpseg-1 -p 5432 -b 1 -z 176 --silent-mode=true -i -M master -C -1 -x 307 -c gp_role=utility
gpadmin 744936 744935 0 03:56 ? 00:00:01 postgres: port 5432, logger process
gpadmin 745108 744935 83 03:57 ? 02:00:12 postgres: port 5432, startup pass 3 process
gpadmin 766611 765875 0 06:20 pts/699 00:00:00 grep postgres

Then run lsof on that PID and grep for pg_xlog:

[gpadmin@avw7hdm2p1 gpseg-1]$ /usr/sbin/lsof -p 745108 | grep pg_xlog
postgres 745108 gpadmin 4r REG 8,64 67108864 85899379122 /data/hawq/master_new/gpseg-1/pg_xlog/000000010000036E00000025
[gpadmin@avw7hdm2p1 gpseg-1]$

 Now list all the xlog files and check which one is currently using, and how many are left.

[...]
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 20:30 000000010000036E00000030
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 20:48 000000010000036E00000031
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 22:02 000000010000036E00000032
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 23:16 000000010000036E00000033
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 23:16 000000010000036E00000034
-rw------- 1 gpadmin gpadmin 67108864 Feb 25 23:52 000000010000036E00000035
-rw------- 1 gpadmin gpadmin 67108864 Feb 26 01:48 000000010000036E00000036
-rw------- 1 gpadmin gpadmin 67108864 Feb 26 01:48 000000010000036E00000037

Comments

Powered by Zendesk