Pivotal Knowledge Base

Follow

HAWQ Services Working Correctly But HAWQ Start Hangs on Segment

Environment

 Product  Version
 Pivotal HDB  2.x
 Pivotal HDP  2.4 / 2.5

Symptom

1. HAWQ restart via Ambari hangs on one or more segments.

2. HAWQ comes up and is usable, however, one segment is restarting in "hawq start cluster" or in Ambari when starting the cluster. 

Error Message:

?TEST [gpadmin@dn1 hawqAdminLogs]$ cat hawq_start_20161209.log
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-Prepare to do 'hawq start'
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-You can find log in:
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_start_20161209.log
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-GPHOME is set to:
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-/usr/local/hawq/.
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[DEBUG]:-Current user is 'gpadmin'
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[DEBUG]:-Parsing config file:
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[DEBUG]:-/usr/local/hawq/./etc/hawq-site.xml
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-Start hawq with args: ['start', 'segment']
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-Gathering information and validating the environment...
20161209:12:11:59:263498 hawq_start:dn1:gpadmin-[INFO]:-Start segment service
waiting for server to start...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................could not start server
20161209:12:22:04:263498 hawq_start:dn1:gpadmin-[ERROR]:-Segment start failed, exit
TEST [gpadmin@dn1 hawqAdminLogs]$ 

Cause

During the restart, pg_ctl will attempt to connect to the restarting database to test if the restart has completed or not. The command used to check the connection will be similar to this: 

PGOPTIONS='-c gp_session_role=utility' psql  -p 40000 -d template1 -c 'select 1;'

If the above fails to connect because of issues with pg_hab.conf or other configuration issues, the restart will timeout and assume the start has failed. 

Resolution

1. On the affected host, determine the port used by the segment. In the case below it is 4000:

[root@mn1 /]# ps -eaf | grep silent
gpadmin 102614 1 7 18:55 ? 00:00:00 /usr/local/hawq_2_1_0_0/bin/postgres -D /data/hawq/segment -i -M segment -p 40000 --silent-mode=true
root 102625 80707 0 18:56 pts/0 00:00:00 grep silent
[root@mn1 /]#

2. Try to connect to the segment using the following command:

PGOPTIONS='-c gp_session_role=utility' psql  -p 40000 -d template1 -c 'select 1;'

3. The command will likely fail with an error similar to this: 

TEST [gpadmin@vrtsthdp005 hawqAdminLogs]$ PGOPTIONS='-c gp_session_role=utility' psql  -p 40000 -d template1 -c 'select 1;'
psql: FATAL:  no pg_hba.conf entry for host "10.34.5.37", user "gpadmin", database "template1", SSL off

4. In the above case, this was due to PGHOST being set incorrectly:

TEST [gpadmin@dn2 ~]$ grep -r PGHOST ~/
/home/gpadmin/.hawq-profile.sh:export PGHOST=dn1.iggroup.local

5. Update the value of PGHOST to the correct value and retry the restart to test.  

Comments

Powered by Zendesk