Pivotal Knowledge Base

Follow

HDB Cluster Failed to Start returning Error: “Unable to Connect to Server”

Environment

 Product  Version
 Pivotal HDB  1.3.x, 2.x
 OS  RHEL 6.x

Symptom

When attempting to start a HDB cluster, it fails giving an error message, “Unable to connect to server.”

Error Message:

20170210:04:25:27:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Starting gpstart with args:
20170210:04:25:27:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Gathering information and validating the environment...
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (HAWQ) 4.2.0 build 1'
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Greenplum Catalog Version: '201412220'
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Starting Master instance in admin mode
20170210:04:25:30:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20170210:04:25:30:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Obtaining Segment details from master...
20170210:04:25:31:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Setting new master era
20170210:04:25:31:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Master Started...
......
20170210:04:25:48:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Process results...
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[WARNING]:-No segment started for content: 38.
......
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-DBID:40 FAILED host:'sdw7.gphd.local' datadir:'/data/hawq/segments/3/gpseg38' with reason:'Unable to connect to server'
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-----------------------------------------------------

Cause

The reserved port (40002) for gpseg38 was taken by another process.  

RCA

Running gpstart with verbose mode revealed more details about the error:

20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[INFO]:-Reviewing /data/hawq/segments/3/gpseg38
20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[WARNING]:-Error getting data stdout:"" stderr:"failed to con
nect: Connection refused (errno: 111) Retrying no 1 failed to connect: Connection refused (errno: 111) Retrying no 2 failed to connec
t: Connection refused (errno: 111) Retrying no 3 failed to connect: Connection refused (errno: 111) Retrying no 4 failed to connect:
Connection refused (errno: 111) Retrying no 5 failed to connect: Connection refused (errno: 111) Retrying no 6 failed to connect: Con
nection refused (errno: 111) Retrying no 7 failed to connect: Connection refused (errno: 111) Retrying no 8 failed to connect: Connec
tion refused (errno: 111) Retrying no 9 failed to connect: Connection refused (errno: 111) Retrying no 10 failed to connect: Connecti
on refused (errno: 111) Retrying no 11 failed to connect: Connection refused (errno: 111) Retrying no 12 failed to connect: Connectio
n refused (errno: 111) Retrying no 13 failed to connect: Connection refused (errno: 111) Retrying no 14 failed to connect: Connection
refused (errno: 111) Retrying no 15 failed to connect: Connection refused (errno: 111) Retrying no 16 failed to connect: Connection
refused (errno: 111) Retrying no 17 failed to connect: Connection refused (errno: 111) Retrying no 18 failed to connect: Connection r
efused (errno: 111) Retrying no 19 failed to connect: Connection refused (errno: 111) "
20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[INFO]:-Marking failed /data/hawq/segments/3/gpseg38, Unable t
o connect to server, 9

A further check found port 40002 was being used by one Java application - HBase RegionServer.

[root@sdw7 ~]# netstat -anp|grep 40002
tcp 0 0 10.178.143.146:40002 10.178.143.137:2181 ESTABLISHED 86153/java [root@sdw7 ~]# ps -ef|grep 86153
hbase 86153 86140 0 04:23 ? 00:00:15 /usr/jdk64/jdk1.7.0_67/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -Dhdp.version=3.0.1.0-1 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hbase/hs_err_pid%p.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hbase/gc.log-201702100423 -Xmn512m -XX:CMSInitiatingOccupancyFraction=70 -Xms4096m -Xmx4096m -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-rmisdcaw7.gphd.local.log -Dhbase.home.dir=/usr/phd/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/phd/3.0.1.0-1/hadoop/lib/native/Linux-amd64-64:/usr/phd/3.0.1.0-1/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

Resolution

Follow these steps to resolve this issue:

  1. Stop HBase RegionServer Java application on host sdw7 on Ambari Web UI.
  2. Start up the HDB cluster.
  3. Start HBase RegionServer Java application on host sdw7.

 

 

Comments

Powered by Zendesk