Pivotal Knowledge Base


"hawqstandby_sync_status/hawqsegments_registration_status" alerts triggered/cleared frequently on Ambari


Pivotal HDB 2.x (with secure "i.e. kerberized" HDFS)


hawqstandby_sync_status and hawqsegments_registration_status alerts are triggered and show up on Ambari web console frequently. Then they will be cleared soon after. 

Error Message:

Alert messages frequently generated in ambari-alerts.log on Ambari Server

2017-08-01 20:50:20,886 [UNKNOWN] [HARD] [HAWQ] [hawqstandby_sync_status] (HAWQ Standby Master Sync Status) Sync status cannot be determined.
2017-08-01 21:03:21,888 [WARNING] [HARD] [HAWQ] [hawqsegments_registration_status] (HAWQ Segment Registration Status) 1 HAWQ Segment is not registered with HAWQ Master. Try restarting HAWQ service if a segment has been added/removed. Check the log file in /var/log/ambari-agent/ambari-alerts.log for more details on unregistered hosts.

Then the alerts are cleared very quickly

2017-08-01 20:51:22,888 [OK] [HARD] [HAWQ] [hawqstandby_sync_status] (HAWQ Standby Master Sync Status) HAWQSTANDBY is in sync with HAWQMASTER.
2017-08-01 21:04:21,888 [OK] [HARD] [HAWQ] [hawqsegments_registration_status] (HAWQ Segment Registration Status) All HAWQ Segments are registered.


Ambari will ssh into HAWQ master node and run following commands to check master/standby synchronization status and segment registration state. 

source /usr/local/hawq/greenplum_path.sh && psql -p 5432 -t --no-align -d postgres -c "SELECT summary_state, error_message FROM gp_master_mirroring;"
source /usr/local/hawq/greenplum_path.sh && psql -p 5432 -t -d postgres -c " SELECT lower(hostname) FROM gp_segment_configuration where role = 'p' and status = 'u' ;"

In a good case, the commands will return results like below ones

INFO 2017-08-01 19:13:20,134 logger.py:75 - call returned (0, 'Synchronized|')
INFO 2017-08-01 20:28:21,085 logger.py:75 - call returned (0, ' qa008.my.com\n qa006.my.com\n qa007.my.com\n qa005.my.com\n qa004.my.com')

However in this case results of these two commands had some failure info from kinit command.
You may check ambari-agent.log on the HAWQ master.

INFO 2017-08-01 22:05:20,131 logger.py:75 - call returned (0, 'kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_30863) while getting initial credentials\nSynchronized|')
INFO 2017-08-01 22:11:22,710 logger.py:75 - call returned (0, 'kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_30863) while getting initial credentials\nqa008.my.com\n qa006.my.com\n qa007.my.com\n qa005.my.com\n qa004.my.com')

Because of the extra failure messages included in commands' result, Ambari could not determine HAWQ master/standby synchronization status as well as segment registration state. Therefore the alerts are triggered. 

Once Ambari tries to run the commands again later, they return a good result, which clears the alerts then. 

Error "kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_30863)" indicates kinit is run and fails for the user with UID 30863, which is user gpadmin

By design, Ambari will not run kinit as gpadmin and store credentials cache to file /tmp/krb5cc_30863.
In this specific case, there was user added 'kinit' command in the .bash_profile like below

#Kerberos Authentication
if [ -f /etc/security/keytabs/hawq.service.keytab ]; then
/usr/bin/kinit -kt /etc/security/keytabs/hawq.service.keytab postgres/zooqa002.my.com@MY.COM

This means each time Ambari ssh into HAWQ master host as gpadmin to run necessary commands for checking synchronization/registration status; the above kinit command is executed. If this kinit is run very frequently sometimes, it will fail with "Failed to store credentials" error, which will be then included in the result of commands for checking synchronization/registration status. 


Remove or comment out the user added lines for running kinit in .bash_profile



Powered by Zendesk