Pivotal Knowledge Base

Follow

GPSTART Segment Error: "Failed to Connect: Connection Refused"

Environment

 Product  Version
 Pivotal Greenplum  4.2.x
 OS  All Supported OS

Overview

gpstart failed to connect segments returning the error below:

[INFO]:-DBID:62  FAILED  host:' datadir:'/data1/primary/gpseg60' with reason:'Start failed; check segment logfile.  "failed to connect: Connection refused (errno: 111)  failed to connect: Connection refused (errno: 111)  Retrying no 1  failure: timeout  Retrying no 2  failure: OtherTransitionInProgress failure: OtherTransitionInProgress"'exit

And:

20131121:01:55:38:001215 gpstart:mdw:gpadmin-[WARNING]:-FATAL:  DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1534)

Description

  • Log in to the failed segments to check if its postmaster and utility processes exist.
  • If yes, run below shell script which generates a shell script "test.sh" that is to ping each primary segment to find out which segments can not accept connections currently.
PGOPTIONS='-c gp_session_role=utility' psql -d template1 -Atc "copy (select dbid, hostname, port from gp_segment_configuration where role = 'p' and content != -1) to stdout delimiter ' '" | while read dbid host port; do
echo "echo DBID: $dbid"
echo "PGOPTIONS='-c gp_session_role=utility' psql -h $host -p $port -d template1 -c 'select 1;'"
done > test.sh
  • Execute "test.sh" on master and output to "test.out":
chmod 755 test.sh
./test.sh > test.out 2>&1
  • Check which DBIDs can not be connected:
grep starting test.out
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
(...)
  • Go to the DBID which can not be connected and check if there is a "startup pass 2" utility process:
 gpadmin@sdw5:~> ps -ef|grep 40000
gpadmin 24956 1 0 03:12 ? 00:00:10 /usr/local/greenplum-db-4.2.5.2/bin/postgres -D /data1/primary/gpseg24 -p 40000 -b 26 -z 96 --silent-mode=true -i -M quiescent -C 24
gpadmin 25029 24956 0 03:12 ? 00:00:00 postgres: port 40000, logger process
gpadmin 25068 24956 0 03:12 ? 00:00:18 postgres: port 40000, filerep transition process
gpadmin 25069 24956 0 03:12 ? 00:00:05 postgres: port 40000, primary process
gpadmin 25070 25069 0 03:12 ? 00:00:14 postgres: port 40000, primary receiver ack process
gpadmin 25071 25069 0 03:12 ? 00:01:14 postgres: port 40000, primary sender process
gpadmin 25072 25069 0 03:12 ? 00:00:19 postgres: port 40000, primary consumer ack process
gpadmin 25073 25069 0 03:12 ? 00:00:06 postgres: port 40000, primary recovery process
gpadmin 25074 25069 0 03:12 ? 00:00:03 postgres: port 40000, primary verification process
gpadmin 25095 24956 2 03:14 ? 00:04:30 postgres: port 40000, startup pass 2 process
gpadmin 29266 29231 0 06:31 pts/2 00:00:00 grep 40000
  • If so, strace that process to make sure it is doing work:
Process 25095 attached - interrupt to quit
semop(1508639228, 0x7fffddb442d0, 1)    = 0
open("global/5090", O_RDWR)             = 9
semop(1508639228, 0x7fffddb41590, 1)    = 0
semop(1508639228, 0x7fffddb41590, 1)    = 0
semop(1508540921, 0x7fffddb41590, 1)    = 0
semop(1508639228, 0x7fffddb45500, 1)    = 0
semop(1508639228, 0x7fffddb45500, 1)    = 0
semop(1508540921, 0x7fffddb45500, 1)    = 0
lseek(9, 848723968, SEEK_SET)           = 848723968
write(9, "\274\0\0\0PkUQ\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768
lseek(9, 653361152, SEEK_SET)           = 653361152
read(9, "\274\0\0\0\260a\363P\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768

In the above example, it is rolling back some large transactions because before shutdown GPDB, a "DROP DATABASE" SQL was killed. Because there are too many files in that database, it took several hours to complete recovery. In this case, please wait for the recovery to complete until this primary segment can be connected using utility mode. Also, keep running "test.sh" to monitor if the "starting up" segment count is decreasing.

gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up

gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
  • Finally, all segments should finish recovery and no more "starting up" segments should exist. If that is the case, Greenplum can be restarted gracefully.
gpstop -af
gpstart

NOTE: When primary segments are in recovery, please do NOT restart Greenplum immediately after you see the gpstart errors because the primary segments need to recover anyway.

Comments

Powered by Zendesk