Pivotal Knowledge Base

Follow

Troubleshooting Pivotal HDB offline segments

Environment 

Product Version
Pivotal Hadoop Database (HDB) 2.x
Pivotal Hortonworks Data Platform (HDP)  2.3 / 2.4
Ambari 2.x

This article only applies to Pivotal HDB 2.x. For HAWQ 1.0, see Pivotal HAWQ: segment failures and recovery scenarios.

Overview

This article helps to troubleshoot segments that are showing as failed, offline, or dead. 
 
Note: If any maintenance needs to be performed on any Pivotal HDB segment hosts that will require removal or shutdown of a Pivotal HDB segment host, please see the relevant article below:

Symptom

One or more segments are showing as failed.

If all segments are down, queries may fail with the following: 

gpadmin=# select * from test;
ERROR: failed to acquire resource from resource manager, 4 of 4 segments are unavailable,
exceeds 25.0% defined in GUC hawq_rm_rejectrequest_nseg_limit.
The allocation request is rejected. (pquery.c:804)
gpadmin=#

Troubleshooting steps:

1. Log on to Ambari and review the state of segments : HAWQ > Summary:

2. Review the gp_segment_configuration table to determine the state of segments. Based on the output, use the Pivotal HDB documents to help troubleshoot the segments.

gpadmin=# select * from gp_segment_configuration;
registration_order | role | status | port | hostname | address | description
--------------------+------+--------+-------+----------------+---------------+-----------------------------------------
-1 | s | u | 5432 | hdm2.hdp.local | 172.28.21.69 |
0 | m | u | 5432 | hdm1 | hdm1 |
3 | p | u | 40000 | hdw2.hdp.local | 172.28.21.116 |
1 | p | u | 40000 | hdw3.hdp.local | 172.28.21.117 |
2 | p | d | 40000 | hdw1.hdp.local | 172.28.21.114 | heartbeat timeout;no global node report
(5 rows)

3. Log on to the HAWQ master and run "hawq state" to get an overview of the HAWQ system:

[gpadmin@hdm1 ~]$ hawq state
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:--HAWQ instance status summary
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Master instance = Active
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Master standby = hdm2.hdp.local
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Segment Status
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total segments count from catalog = 3
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment valid (at master) = 3
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment failures (at master) = 0
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 1
20160720:05:10:01:083387 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 2

4. Based on the above output, there can be a number of scenarios: 

 

1. Postmaster process is not running and there are failures at master:

[gpadmin@hdm1 ~]$ hawq state
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:--HAWQ instance status summary
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Master instance = Active
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Master standby = hdm2.hdp.local
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Segment Status
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total segments count from catalog = 3
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment valid (at master) = 2
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment failures (at master) = 1
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 1
20160720:05:13:15:084071 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 2
[gpadmin@hdm1 ~]$

Cause:

The postmaster process has failed on the segment, this is likely segment-related rather than YARN related.

Resolution:

  1. Make sure the host is online and reachable via Single SHell (SSH).
  2. Review the segment logs and messages files to understand why the segment process stopped.
  3. If the segment failed because of local file system corruption, review How to recover Pivotal HDB local file system segment files.
  4. If the segment failed for any other reason, correct the root cause of the issue.
  5. Start the segment via Ambari or with hawq start segment.

2. Postmaster processes are running but segments show failures at master:

[gpadmin@hdm1 ~]$ hawq state
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:--HAWQ instance status summary
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Master instance = Active
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Master standby = hdm2.hdp.local
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Segment Status
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:------------------------------------------------------
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total segments count from catalog = 3
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment valid (at master) = 0
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total segment failures (at master) = 3
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160720:05:38:42:089436 hawq_state:hdm1:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
[gpadmin@hdm1 ~]$

Cause:

Ambari will show all segments as up because the postmaster processes are still running. hawq state shows segments as down because resources could not be assigned to the segment via YARN. This is likely caused by an issue with resource managemenr.

Resolution:

This is likely an issue requesting resources from the YARN Resource Manager. Review the master log for messages similar to the following:

2016-07-20 05:45:28.409447, p61779, th140332199721088, ERROR Failed to setup RPC connection to ""hdm1.hdp.local:8032"" caused by:
TcpSocket.cpp: 283: YarnNetworkConnectException: Connect to ""hdm1.hdp.local:8032"" failed: (errno: 111)

Some steps that may help resolve this or understand the issue further:

  • Is the YARN Resource Manager started? 
  • Are all the Node Managers started?
  • Are there enough resources available on the cluster?
  • Are there applications queuing up in the YARN Resource Manager?
  • Is the HAWQ application running? This can be checked by running the YARN application list.  

Once the resource management issue has been resolved, segments should come back online automatically.

3. Negative number in "Total segment failures (at master)":

[gpadmin@hawq20dn2 ~]$ hawq state
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:--HAWQ instance status summary
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master instance = Active
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master standby = hawq20dn1.lab
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Segment Status
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segments count from catalog = 4
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment valid (at master) = 4
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment failures (at master) = -1
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160719:17:36:54:480647 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
[gpadmin@hawq20dn2 ~]$

Cause: 

During a HAWQ segment addition or removal, some configuration files were not updated correctly.

Resolution:

  1. Review /usr/local/hawq/etc/slaves on the master and standby master. Confirm that only active segments are present.
  2. If the issue persists, restart HAWQ via Ambari. Ambari will correct any incorrect configurations during the restart.

 

 

 

 

Comments

Powered by Zendesk