Pivotal Knowledge Base

Follow

Initialize a Pivotal HDB Standby Master & Test Failover

Environment

  • PHD 1.x
  • PHD 2.x
  • Two Pivotal HDB masters nodes

Purpose

The pivotal documentation explains why having a standby master is important. In this artilcle will show you how easy it is to initialize a standby master and how to perform a failover/failback test.

Perquisites

  • Pivotal HDB and pxf rpms must be installed on standby master node
  • gpadmin user must exist on standby master node
  • gpadmin user must have /usr/local/hawq/greenplum_path.sh sourced in ~/.bash_profile on both master and standby

Steps

1. Verify the gp_segment_configuration before adding a standby master

From the output we see the current HDB master is hdm1.hadoop.local. We know DBID 1 is the primary master because the master instances always have a content id of "-1" and the role is set to "p" ( primary). In this output there is not reference to a standby master so we now have to initialize one.

[gpadmin@hdm1 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from pg_catalog.gp_segment_configuration ;
dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+-------------------+-------------------+------------------+------------
1 | -1 | p | p | s | u | 5432 | hdm1.hadoop.local | hdm1.hadoop.local | |
2 | 0 | p | p | s | u | 40000 | hdw1.hadoop.local | hdw1.hadoop.local | |
4 | 2 | p | p | s | u | 40000 | hdw2.hadoop.local | hdw2.hadoop.local | |
6 | 4 | p | p | s | u | 40000 | hdw3.hadoop.local | hdw3.hadoop.local | |
3 | 1 | p | p | s | u | 40001 | hdw1.hadoop.local | hdw1.hadoop.local | |
5 | 3 | p | p | s | u | 40001 | hdw2.hadoop.local | hdw2.hadoop.local | |
7 | 5 | p | p | s | u | 40001 | hdw3.hadoop.local | hdw3.hadoop.local | |
(7 rows)

2. Start initializing / adding a HDB standby master

[gpadmin@hdm1 ~]$ gpinitstandby -s hdm2.hadoop.local
The filespace locations on the master must be mapped to
locations on the standby. These locations must be empty on the
standby master host. The default provided is the location of
the filespace on the master. In most cases the defaults can be
used. The exception is the pg_system filespace which must be in
the same location on both the master and standby master.

20140704:00:07:11:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20140704:00:07:12:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Checking for filespace directory /data1/master/gpseg-1 on hdm2.hadoop.local
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Checking for filespace directory /data1/master/dfs/gpseg-1 on hdm2.hadoop.local
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:------------------------------------------------------
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum standby master initialization parameters
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:------------------------------------------------------
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum master hostname = hdm1.hadoop.local
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum master data directory = /data1/master/gpseg-1
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum master port = 5432
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum standby master hostname = hdm2.hadoop.local
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum standby master port = 5432
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum standby master data directory = /data1/master/gpseg-1
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum update system catalog = On
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Greenplum stop database mode = smart
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:------------------------------------------------------
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:- Filespace locations
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:------------------------------------------------------
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-pg_system -> /data1/master/gpseg-1
20140704:00:07:14:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-dfs_system -> /data1/master/dfs/gpseg-1
Do you want to continue with standby master initialization? Yy|Nn (default=N):
> y
20140704:00:07:34:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Stopping database...
20140704:00:07:45:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Updating pg_hba.conf file...
20140704:00:07:46:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Starting master in utility mode...
20140704:00:07:52:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Adding standby master to catalog...
20140704:00:07:52:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Database catalog updated successfully.
20140704:00:07:52:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Stopping master...
20140704:00:07:56:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Forcing changed blocks to disk for filespace /data1/master/dfs/gpseg-1...
20140704:00:07:57:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Filespace directory does not exist on hdm2.hadoop.local
20140704:00:07:57:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Creating hdm2.hadoop.local:/data1/master/dfs/gpseg-1
20140704:00:07:58:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Copying filespace directory to hdm2.hadoop.local
20140704:00:07:58:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Forcing changed blocks to disk for filespace /data1/master/gpseg-1...
20140704:00:07:59:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Filespace directory does not exist on hdm2.hadoop.local
20140704:00:07:59:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Creating hdm2.hadoop.local:/data1/master/gpseg-1
20140704:00:08:00:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Copying filespace directory to hdm2.hadoop.local
20140704:00:08:06:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Updating filespace flat files
20140704:00:08:06:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Updating filespace flat files
20140704:00:08:06:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Try to create temporary directories
20140704:00:08:07:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Removing pg_hba.conf backup...
20140704:00:08:08:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Starting database in production mode...
20140704:00:08:39:705040 gpinitstandby:hdm1:gpadmin-[INFO]:-Successfully created standby master on hdm2.hadoop.local

3. Verify if standby master has been added successfully

dbid 1 on host hdm1.hadoop.local is the current active master instance and dbid 8 on host hdm2.hadoop.local is the current standby instance. dbid 8 role is set to "m" (mirror).

[gpadmin@hdm1 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from pg_catalog.gp_segment_configuration ;
dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+-------------------+-------------------+------------------+------------
1 | -1 | p | p | s | u | 5432 | hdm1.hadoop.local | hdm1.hadoop.local | |
2 | 0 | p | p | s | u | 40000 | hdw1.hadoop.local | hdw1.hadoop.local | |
4 | 2 | p | p | s | u | 40000 | hdw2.hadoop.local | hdw2.hadoop.local | |
6 | 4 | p | p | s | u | 40000 | hdw3.hadoop.local | hdw3.hadoop.local | |
3 | 1 | p | p | s | u | 40001 | hdw1.hadoop.local | hdw1.hadoop.local | |
5 | 3 | p | p | s | u | 40001 | hdw2.hadoop.local | hdw2.hadoop.local | |
7 | 5 | p | p | s | u | 40001 | hdw3.hadoop.local | hdw3.hadoop.local | |
8 | -1 | m | m | s | u | 5432 | hdm2.hadoop.local | hdm2.hadoop.local | |
(8 rows)

gpadmin=# select * from pg_catalog.gp_master_mirroring ;
summary_state | detail_state | log_time | error_message
---------------+--------------+------------------------+---------------
Synchronized | | 2014-07-04 00:08:38-07 |
(1 row)

[gpadmin@hdm1 ~]$ gpstate -f
20140704:00:10:26:705333 gpstate:hdm1:gpadmin-[INFO]:-Starting gpstate with args: -f
20140704:00:10:27:705333 gpstate:hdm1:gpadmin-[INFO]:-local HAWQ Version: 'postgres (HAWQ) 4.2.0 build 1'
20140704:00:10:28:705333 gpstate:hdm1:gpadmin-[INFO]:-master HAWQ Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr 23 2014 16:12:32'
20140704:00:10:28:705333 gpstate:hdm1:gpadmin-[INFO]:-Obtaining Segment details from master...
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:-Standby master details
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:-----------------------
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:- Standby address = hdm2.hadoop.local
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:- Standby data directory = /data1/master/gpseg-1
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:- Standby port = 5432
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:- Standby PID = 106500
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:- Standby status = Standby host passive
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--------------------------------------------------------------
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--gp_master_mirroring table
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--------------------------------------------------------------
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--Summary state: Synchronized
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--Detail state:
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--Log time: 2014-07-04 00:08:38-07
20140704:00:10:31:705333 gpstate:hdm1:gpadmin-[INFO]:--------------------------------------------------------------

4. Test failover and make current active master failover to standby

Stop the master instance to simulate a failure at master side.

[gpadmin@hdm1 ~]$ gpstop -m
20140704:00:11:53:705417 gpstop:hdm1:gpadmin-[INFO]:-Starting gpstop with args: -m
20140704:00:11:53:705417 gpstop:hdm1:gpadmin-[INFO]:-Gathering information and validating the environment...
20140704:00:11:54:705417 gpstop:hdm1:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20140704:00:11:54:705417 gpstop:hdm1:gpadmin-[INFO]:-Obtaining Segment details from master...
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-Greenplum Version: 'postgres (HAWQ) 4.2.0 build 1'
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-There are 0 connections to the database
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart'
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-Master host=hdm1.hadoop.local
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-Commencing Master instance shutdown with mode=smart
20140704:00:11:55:705417 gpstop:hdm1:gpadmin-[INFO]:-Master segment instance directory=/data1/master/gpseg-1

Activate the standby master (Optional, using the flag -c <hostname> while activating, you could initialize a new standby HDB master).

[gpadmin@hdm2 ~]$ source /usr/local/hawq/greenplum_path.sh
[gpadmin@hdm2 ~]$ export MASTER_DATA_DIRECTORY=/data1/master/gpseg-1
[gpadmin@hdm2 ~]$ gpactivatestandby -d /data1/master/gpseg-1
220140704:00:13:40:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Examining log file /data1/master/gpseg-1/pg_log/gpdb-2014-07-04_000823.csv for warnings and errors...
20140704:00:13:41:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-list index out of range
20140704:00:13:41:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:------------------------------------------------------
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Master data directory = /data1/master/gpseg-1
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-gpsyncmaster running = yes
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Last log entry time = 2014-07-04 00:11:56.124739 PDT
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:- 0:01:46.504265 ago
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Create new standby master = no
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Force standby activation = no
20140704:00:13:43:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:------------------------------------------------------
Do you want to continue with standby master activation? Yy|Nn (default=N):
> y
20140704:00:13:45:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Stopping gpsync process...
20140704:00:13:48:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Successfully shutdown sync process
20140704:00:13:55:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Starting standby master database in utility mode...
20140704:00:14:00:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Reading current configuration...
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Updating catalog...
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Database catalog updated successful
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-/data1/master/gpseg-1/gp_dbid - write_gp_dbid
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-found existing file
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-removed existing file
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-opening new file
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-wrote dbid: 1
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-setting read only
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-verifying file
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Updating filespace flat files
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Updating filespace flat files
20140704:00:14:08:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Stopping database...
20140704:00:14:21:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Starting database in production mode...
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:------------------------------------------------------
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-The activation of the standby master has completed successfully.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-hdm2.hadoop.local is now the new primary master.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-You will need to update your user access mechanism to reflect
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-the change of master hostname.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Do not re-start the failed master while the fail-over master is
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-operational, this could result in database corruption!
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-MASTER_DATA_DIRECTORY is now /data1/master/gpseg-1 if
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-this has changed as a result of the standby master activation, remember
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-to change this in any startup scripts etc, that may be configured
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-to set this value.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-New standby master not initialized!
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-MASTER_PORT is now 5432, if this has changed, you
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-may need to make additional configuration changes to allow access
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-to the Greenplum instance.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Refer to the Administrator Guide for instructions on how to re-activate
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-the master to its previous state once it becomes available.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-Query planner statistics must be updated on all databases
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-following standby master activation.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:-When convenient, run ANALYZE against all user databases.
20140704:00:14:39:106592 gpactivatestandby:hdm2.hadoop.local:gpadmin-[INFO]:------------------------------------------------------

Verify if the standby is now acting as the active master

From the output we can see dbid 1 on host hdm2.hadoop.local is the current master instance because the content id is "-1" and role is set to "p"

[gpadmin@hdm2 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from pg_catalog.gp_segment_configuration ;
dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+-------------------+-------------------+------------------+------------
2 | 0 | p | p | s | u | 40000 | hdw1.hadoop.local | hdw1.hadoop.local | |
4 | 2 | p | p | s | u | 40000 | hdw2.hadoop.local | hdw2.hadoop.local | |
6 | 4 | p | p | s | u | 40000 | hdw3.hadoop.local | hdw3.hadoop.local | |
3 | 1 | p | p | s | u | 40001 | hdw1.hadoop.local | hdw1.hadoop.local | |
5 | 3 | p | p | s | u | 40001 | hdw2.hadoop.local | hdw2.hadoop.local | |
7 | 5 | p | p | s | u | 40001 | hdw3.hadoop.local | hdw3.hadoop.local | |
1 | -1 | p | p | s | u | 5432 | hdm2.hadoop.local | hdm2.hadoop.local | |
(7 rows)

5. Test failback

Similar to failover, just ensure that you remove or rename directories in $MASTER_DATA_DIRECTORY (/data1/master/gpseg-1 & /data1/master/dfs/gpseg-1 in this example) so that gpactivatestandby does not fail due to existing directories.

6. Add back a standby master

Take action in step 5 on new standby master first, then run "gpinitstandby -s <hostname>" again to add back a standby master.

 

 NOTE: if there is already a standby master but in a state of "out-of-synchronizaiton", "gpinitstandby -n" could be used to resynchronize standby master with master

Comments

  • Avatar
    Josiah Berkebile

    This article seems inconsistent with HAWQ version 2.0.0.0 build 22126. The configuration tables have a different structure (different columns and number of columns), and some utilities like gpstate and gpstop aren't available in my environment.

Powered by Zendesk