Pivotal Knowledge Base

Follow

Ambari Alert - xx HAWQ Segments are not Registered with HAWQ Master

Environment

 Product  Version
 Pivotal HDB  2.x
 OS  RHEL 6.x
 Others  

Symptom

For a HDB 2.0 cluster with 9 segment hosts, "HAWQ Segment Registration" alert was seen on Ambari web (as shown in below screenshot).

Error Message:


ERROR 2016-10-28 10:36:23,892 alert_segment_registration_status.py:80 -  [Alert HAWQ] Segments Unregistered: ['baybigupcpn04.bayad.co.th', 'BAYBIGUPCPN04.bayad.co.th', 'baybigupcpn02.bayad.co.th', 'BAYBIGUPCPN06.bayad.co.th', 'baybigupcpn06.bayad.co.th', 'BAYBIGUPCPN03.bayad.co.th', 'baybigupcpn05.bayad.co.th', 'BAYBIGUPCPN05.bayad.co.th', 'BAYBIGUPCPN07.bayad.co.th', 'baybigupcpn09.bayad.co.th', 'BAYBIGUPCPN02.bayad.co.th', 'baybigupcpn03.bayad.co.th', 'baybigupcpn08.bayad.co.th', 'baybigupcpn01.bayad.co.th', 'baybigupcpn07.bayad.co.th', 'BAYBIGUPCPN08.bayad.co.th', 'BAYBIGUPCPN09.bayad.co.th', 'BAYBIGUPCPN01.bayad.co.th'] are unregistered/down.

Cause

Hostname of HAWQ segments stored in system table didn't match those in Ambari database.

RCA

Ambari agent on HAWQ master host will periodically check segment registration status by comparing hostnames retrieved from the following two sources.

1. Run a query against system table of HAWQ cluster 

postgres=# SELECT hostname FROM gp_segment_configuration where role = 'p' and status = 'u';
hostname
---------------------------
BAYBIGUPCPN01.bayad.co.th
BAYBIGUPCPN08.bayad.co.th
BAYBIGUPCPN06.bayad.co.th
BAYBIGUPCPN02.bayad.co.th
BAYBIGUPCPN03.bayad.co.th
BAYBIGUPCPN09.bayad.co.th
BAYBIGUPCPN05.bayad.co.th
BAYBIGUPCPN04.bayad.co.th
BAYBIGUPCPN07.bayad.co.th
(9 rows)

2. Data stored in Ambari database which will be put to the file <HAWQ_INSTALLATION_DIR>/etc/slaves

[root@BAYBIGUPADM01 ~]# cat /usr/loca/hawq/etc/slaves 
baybigupcpn01.bayad.co.th
baybigupcpn02.bayad.co.th
baybigupcpn03.bayad.co.th
baybigupcpn04.bayad.co.th
baybigupcpn05.bayad.co.th
baybigupcpn06.bayad.co.th
baybigupcpn07.bayad.co.th
baybigupcpn08.bayad.co.th
baybigupcpn09.bayad.co.th

In this case, the hostnames are shown in upper/lower case separately in the two data sources. Hence the comparison result indicated that they were totally different, and as a consequence, it's shown "18 HAWQ Segments are not registered ..." on Ambari web.

Resolution

As there is only the difference of upper/lower case in hostnames in this example, it seems that the alert should not be reported. An internal JIRA AMBR-519 has been submitted to have engineering do further investigation and fix it if it's a real defect.

Before a final solution is ready, the temporary solution is to either update slaves file or system table to make hostname of HAWQ segments exactly identical in those 2 data sources. 

However, this temporary solution won't survive a restart of HAWQ cluster, for which the change will be rolled back. So another better solution is to change the name of all segment hosts to make them consistent with those in the slaves file. Restart of the host is needed to make the change have an effect.

Another workaround is to modify one Ambari script temporarily as shown below.

1. Go to directory /var/lib/ambari-server/resources/common-services/HAWQ/2.0.0/package/alerts on Ambari server host and make a copy of file alert_segment_registration_status.py

2. Then change file alert_segment_registration_status.py as shown by following diff output

[root@admin alerts]# diff alert_segment_registration_status.py.orig alert_segment_registration_status.py
71a72,75
>     #converted to lowercase first prior to comparision
>     ambari_segment_list_low = [x.lower() for x in ambari_segment_list]
>     hawq_segment_list_low = [y.lower() for y in hawq_segment_list]
> 
73c77,78
<     segment_diff = (set(hawq_segment_list) ^ set(ambari_segment_list))
---
>     #segment_diff = (set(hawq_segment_list) ^ set(ambari_segment_list))
>     segment_diff = (set(hawq_segment_list_low) ^ set(ambari_segment_list_low))

3. Restart Ambari Server with ambari-server restart

Additional Information

Comments

Powered by Zendesk