Pivotal Knowledge Base

Follow

How to replace a Pivotal HDB host via Ambari in case of hardware issues on the host

Environment

Product Version
Pivotal HDP (Hortonworks Data Platform) 2.4
Ambari 2.2
Pivotal HDB 2.0

Purpose

This article describes how to replace a Pivotal HDB segment host when a hardware or software issue leads to the complete loss of the host and all of its data and configurations.

Note: If only the Pivotal HDB segment directory is lost, do not follow this procedure, follow How to recover Pivotal HDB local file system segment files instead.

Procedure

Important Notes:

  • Replace only one segment at a time to reduce the risk of data loss.
  • If HDFS data on the host was also lost, HDFS will need to be rebalanced after replacement to avoid performance issues.
  • After rebalance of HDFS, the HDFS Metadata cache needs to be cleared; select gp_metadata_cache_clear().

1. Log on to Ambari. 

2. If the host is still accessible in Ambari, go to Hosts > Click on affected host > Host actions > Stop all components:

3. Via Single SHell (SSH) on the host, stop ambari-agent: 

[root@hawq20dn3 ~]# ambari-agent stop
Verifying Python version compatibility...
Using python /usr/bin/python
Found ambari-agent PID: 2647
Stopping ambari-agent
Removing PID file at /var/run/ambari-agent/ambari-agent.pid
ambari-agent successfully stopped

4. Delete the host via Ambari by going to Hosts > Click on affected host > Host actions > Delete Host (the API may need to be used in some cases).

  Note: Ambari may hang for up to ten minutes after clicking OK on this pop-up:

 

5. At this point, the hawq state may show a segment failure:

[gpadmin@hawq20dn2 ~]$ hawq state
ssh: connect to host hawq20dn3.lab port 22: No route to host
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:--HAWQ instance status summary
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master instance = Active
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master standby = hawq20dn1.lab
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Segment Status
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segments count from catalog = 4
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment valid (at master) = 3
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment failures (at master) = 1
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 1
20160715:10:15:09:442319 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
[gpadmin@hawq20dn2 ~]$

6. If there is a segment failure as per above, follow the Pivotal HDB documentation to remove the HAWQ segment from the HAWQ configuration correctly.

7. Restart HAWQ via Ambari and there should be no more failures in the hawq state:

[gpadmin@hawq20dn2 ~]$ hawq state
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:--HAWQ instance status summary
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master instance = Active
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master standby = hawq20dn1.lab
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Segment Status
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segments count from catalog = 3
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment valid (at master) = 3
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment failures (at master) = 0
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160715:10:25:10:444704 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
[gpadmin@hawq20dn2 ~]$ 

8. Follow these steps to expand the Pivotal HDB cluster via Ambari.

9. Restart HAWQ via Ambari for the new node to be recognised correctly.

10. Confirm that "hawq state" and Ambari show the same number of segments:

[gpadmin@hawq20dn2 ~]$ hawq state
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:--HAWQ instance status summary
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master instance = Active
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master standby = hawq20dn1.lab
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment instance count from config file = 4
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Segment Status
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segments count from catalog = 4
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment valid (at master) = 4
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment failures (at master) = 0
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160715:11:00:05:451972 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 4
[gpadmin@hawq20dn2 ~]$

 

 

Comments

Powered by Zendesk