Pivotal Knowledge Base

Follow

gpstart fails with none of its segments started

Environment

Product Version
Pivotal Greenplum (GPDB) 4.2.x, 4.3.x
OS RHEL 6.x

Symptom

When attempting a gpstart to start up the GPDB cluster, it fails several minutes later with none of the segments having started successfully.

The error message is shown below:

20160416:03:41:53:013788 gpstart:sdw0537:gpadmin-[INFO]:-Starting gpstart with args: -a -R
20160416:03:41:53:013788 gpstart:sdw0537:gpadmin-[INFO]:-Gathering information and validating the environment...
20160416:03:41:53:013788 gpstart:sdw0537:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.5.4 build 1'
20160416:03:41:53:013788 gpstart:sdw0537:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
20160416:03:41:53:013788 gpstart:sdw0537:gpadmin-[INFO]:-Starting Master instance in admin mode
20160416:03:41:55:013788 gpstart:sdw0537:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20160416:03:41:55:013788 gpstart:sdw0537:gpadmin-[INFO]:-Obtaining Segment details from master...
20160416:03:41:57:013788 gpstart:sdw0537:gpadmin-[INFO]:-Setting new master era
20160416:03:41:57:013788 gpstart:sdw0537:gpadmin-[INFO]:-Master Started...
20160416:03:41:57:013788 gpstart:sdw0537:gpadmin-[INFO]:-Shutting down master
20160416:03:41:59:013788 gpstart:sdw0537:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
20160416:03:47:04:013788 gpstart:sdw0537:gpadmin-[INFO]:-Process results...
20160416:03:47:04:013788 gpstart:sdw0537:gpadmin-[ERROR]:-No segment started for content: 0.
20160416:03:47:04:013788 gpstart:sdw0537:gpadmin-[INFO]:-dumping success segments: []
20160416:03:47:04:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:04:013788 gpstart:sdw0537:gpadmin-[INFO]:-DBID:234 FAILED host:'sdw0529.gpdb.local' datadir:'/data1/mirror/gpseg104' with reason:'cmd had r
c=255 completed=True halted=False  ...... 20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-DBID:129 FAILED host:'sdw0531.gpdb.local' datadir:'/data4/primary/gpseg127' with reason:'cmd had rc=255 completed=True halted=False
stdout='20160416:03:41:59:012200 gpsegstart.py_sdw0531:gpadmin:sdw0531:gpadmin-[INFO]:-Starting gpsegstart.py with args: -C en_US.utf8:en_US.utf8:en_US.utf8 -M quiescent -V postgres (Greenplum Database) 4.3.5.4 build 1 -n 128 --era 2f50b50abc5e788c_160416034157 -t 600 -p KGRwMApTJ2Ric0J5UG9ydCcKcDEKKGRwMgpJNDAwMDAKKGRwMwpTJ3RhcmdldE1vZGUnCnA0ClMncHJpbWFyeScKcDUKc1MnZGJpZCcKcDYKSTEyMgpzUydob3N0TmFtZScKcDcKUydzbDczdnNhaGRwMDUzMScKcDgKc1MncGVlclBvcnQnCnA5Ckk1MTAwMApzUydwZWVyUE1Qb3J0JwpwMTAKSTUwMDAwCnNTJ3BlZXJOYW1lJwpwMTEKUydzbDczdnNhaGRwMDUwMScKcDEyCnNTJ2Z1bGxSZXN5bmNGbGFnJwpwMTMKSTAwCnNTJ21vZGUnCnAxNApTJ3MnCnAxNQpzUydob3N0UG9ydCcKcDE2Ckk0MTAwMApzc0k0MDAwMQooZHAxNwpnNApnNQpzZzYKSTEyMwpzZzcKUydzbDczdnNhaGRwMDUzMScKcDE4CnNnOQpJNTEwMDEKc2cxMApJNTAwMDEKc2cxMQpTJ3NsNzN2c2FoZHAwNTAxJwpwMTkKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNDEwMDEKc3NJNDAwMDIKKGRwMjAKZzQKZzUKc2c2CkkxMjQKc2c3ClMnc2w3M3ZzYWhkcDA1MzEnCnAyMQpzZzkKSTUxMDAyCnNnMTAKSTUwMDAyCnNnMTEKUydzbDczdnNhaGRwMDUwMScKcDIyCnNnMTMKSTAwCnNnMTQKZzE1CnNnMTYKSTQxMDAyCnNzSTQwMDAzCihkcDIzCmc0Cmc1CnNnNgpJMTI1CnNnNwpTJ3NsNzN2c2FoZHAwNTMxJwpwMjQKc2c5Ckk1MTAwMwpzZzEwCkk1MDAwMwpzZzExClMnc2w3M3ZzYWhkcDA1MDEnCnAyNQpzZzEzCkkwMApzZzE0CmcxNQpzZzE2Ckk0MTAwMwpzc0k0MDAwNAooZHAyNgpnNApnNQpzZzYKSTEyNgpzZzcKUydzbDczdnNhaGRwMDUzMScKcDI3CnNnOQpJNTEwMDQKc2cxMApJNTAwMDQKc2cxMQpTJ3NsNzN2c2FoZHAwNTAxJwpwMjgKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNDEwMDQKc3NJNDAwMDUKKGRwMjkKZzQKZzUKc2c2CkkxMjcKc2c3ClMnc2w3M3ZzYWhkcDA1MzEnCnAzMApzZzkKSTUxMDA1CnNnMTAKSTUwMDA1CnNnMTEKUydzbDczdnNhaGRwMDUwMScKcDMxCnNnMTMKSTAwCnNnMTQKZzE1CnNnMTYKSTQxMDA1CnNzSTQwMDA2CihkcDMyCmc0Cmc1CnNnNgpJMTI4CnNnNwpTJ3NsNzN2c2FoZHAwNTMxJwpwMzMKc2c5Ckk1MTAwNgpzZzEwCkk1MDAwNgpzZzExClMnc2w3M3ZzYWhkcDA1MDEnCnAzNApzZzEzCkkwMApzZzE0CmcxNQpzZzE2Ckk0MTAwNgpzc0k0MDAwNwooZHAzNQpnNApnNQpzZzYKSTEyOQpzZzcKUydzbDczdnNhaGRwMDUzMScKcDM2CnNnOQpJNTEwMDcKc2cxMApJNTAwMDcKc2cxMQpTJ3NsNzN2c2FoZHAwNTAxJwpwMzcKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNDEwMDcKc3NJNTAwMDAKKGRwMzgKZzQKUydtaXJyb3InCnAzOQpzZzYKSTI0MgpzZzcKUydzbDczdnNhaGRwMDUzMScKcDQwCnNnOQpJNDEwMDAKc2cxMApJNDAwMDAKc2cxMQpTJ3NsNzN2c2FoZHAwNTI5JwpwNDEKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNTEwMDAKc3NJNTAwMDEKKGRwNDIKZzQKZzM5CnNnNgpJMjQzCnNnNwpTJ3NsNzN2c2FoZHAwNTMxJwpwNDMKc2c5Ckk0MTAwMQpzZzEwCkk0MDAwMQpzZzExClMnc2w3M3ZzYWhkcDA1MjknCnA0NApzZzEzCkkwMApzZzE0CmcxNQpzZzE2Ckk1MTAwMQpzc0k1MDAwMgooZHA0NQpnNApnMzkKc2c2CkkyNDQKc2c3ClMnc2w3M3ZzYWhkcDA1MzEnCnA0NgpzZzkKSTQxMDAyCnNnMTAKSTQwMDAyCnNnMTEKUydzbDczdnNhaGRwMDUyOScKcDQ3CnNnMTMKSTAwCnNnMTQKZzE1CnNnMTYKSTUxMDAyCnNzSTUwMDAzCihkcDQ4Cmc0CmczOQpzZzYKSTI0NQpzZzcKUydzbDczdnNhaGRwMDUzMScKcDQ5CnNnOQpJNDEwMDMKc2cxMApJNDAwMDMKc2cxMQpTJ3NsNzN2c2FoZHAwNTI5JwpwNTAKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNTEwMDMKc3NJNTAwMDQKKGRwNTEKZzQKZzM5CnNnNgpJMjQ2CnNnNwpTJ3NsNzN2c2FoZHAwNTMxJwpwNTIKc2c5Ckk0MTAwNApzZzEwCkk0MDAwNApzZzExClMnc2w3M3ZzYWhkcDA1MjknCnA1MwpzZzEzCkkwMApzZzE0CmcxNQpzZzE2Ckk1MTAwNApzc0k1MDAwNQooZHA1NApnNApnMzkKc2c2CkkyNDcKc2c3ClMnc2w3M3ZzYWhkcDA1MzEnCnA1NQpzZzkKSTQxMDA1CnNnMTAKSTQwMDA1CnNnMTEKUydzbDczdnNhaGRwMDUyOScKcDU2CnNnMTMKSTAwCnNnMTQKZzE1CnNnMTYKSTUxMDA1CnNzSTUwMDA2CihkcDU3Cmc0CmczOQpzZzYKSTI0OApzZzcKUydzbDczdnNhaGRwMDUzMScKcDU4CnNnOQpJNDEwMDYKc2cxMApJNDAwMDYKc2cxMQpTJ3NsNzN2c2FoZHAwNTI5JwpwNTkKc2cxMwpJMDAKc2cxNApnMTUKc2cxNgpJNTEwMDYKc3NJNTAwMDcKKGRwNjAKZzQKZzM5CnNnNgpJMjQ5CnNnNwpTJ3NsNzN2c2FoZHAwNTMxJwpwNjEKc2c5Ckk0MTAwNwpzZzEwCkk0MDAwNwpzZzExClMnc2w3M3ZzYWhkcDA1MjknCnA2MgpzZzEzCkkwMApzZzE0CmcxNQpzZzE2Ckk1MTAwNwpzc3Mu -D 242|112|m|m|s|u|sdw0531.gpdb.local|sdw0531|50000|51000|/data1/mirror/gpseg112|| -D 243|113|m|m|s|u|sdw0531.gpdb.local|sdw0531|50001|51001|/data1/mirror/gpseg113|| -D 244|114|m|m|s|u|sdw0531.gpdb.local|sdw0531|50002|51002|/data2/mirror/gpseg114|| -D 245|115|m|m|s|u|sdw0531.gpdb.local|sdw0531|50003|51003|/data2/mirror/gpseg115|| -D 246|116|m|m|s|u|sdw0531.gpdb.local|sdw0531|50004|51004|/data3/mirror/gpseg116|| -D 247|117|m|m|s|u|sdw0531.gpdb.local|sdw0531|50005|51005|/data3/mirror/gpseg117|| -D 248|118|m|m|s|u|sdw0531.gpdb.local|sdw0531|50006|51006|/data4/mirror/gpseg118|| -D 249|119|m|m|s|u|sdw0531.gpdb.local|sdw0531|50007|51007|/data4/mirror/gpseg119|| -D 122|120|p|p|s|u|sdw0531.gpdb.local|sdw0531|40000|41000|/data1/primary/gpseg120|| -D 123|121|p|p|s|u|sdw0531.gpdb.local|sdw0531|40001|41001|/data1/primary/gpseg121|| -D 124|122|p|p|s|u|sdw0531.gpdb.local|sdw0531|40002|41002|/data2/primary/gpseg122|| -D 125|123|p|p|s|u|sdw0531.gpdb.local|sdw0531|40003|41003|/data2/primary/gpseg123|| -D 126|124|p|p|s|u|sdw0531.gpdb.local|sdw0531|40004|41004|/data3/primary/gpseg124|| -D 127|125|p|p|s|u|sdw0531.gpdb.local|sdw0531|40005|41005|/data3/primary/gpseg125|| -D 128|126|p|p|s|u|sdw0531.gpdb.local|sdw0531|40006|41006|/data4/primary/gpseg126|| -D 129|127|p|p|s|u|sdw0531.gpdb.local|sdw0531|40007|41007|/data4/primary/gpseg127||
20160416:03:42:00:012200 gpsegstart.py_sdw0531:gpadmin:sdw0531:gpadmin-[INFO]:-Validating directories...
20160416:03:42:00:012200 gpsegstart.py_sdw0531:gpadmin:sdw0531:gpadmin-[INFO]:-Validating directory: /data1/mirror/gpseg113 ...... 20160416:03:42:04:012200 gpsegstart.py_sdw0531:gpadmin:sdw0531:gpadmin-[INFO]:-Postmaster /data1/primary/gpseg121 is running (pid 12333)
20160416:03:42:04:012200 gpsegstart.py_sdw0531:gpadmin:sdw0531:gpadmin-[INFO]:-Transitioning segments, mirroringMode is quiescent...
'
stderr='Connection to sdw0531 closed by remote host.
''
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:- Successful segment starts = 0
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[WARNING]:-Failed segment starts = 256 <<<<<<<<
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-Successfully started 0 of 256 segment instances <<<<<<<<
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[WARNING]:-Segment instance startup failures reported
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[WARNING]:-Failed start 256 of 256 segment instances <<<<<<<<
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[WARNING]:-Review /home/gpadmin/gpAdminLogs/gpstart_20160416.log
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-----------------------------------------------------
20160416:03:47:17:013788 gpstart:sdw0537:gpadmin-[INFO]:-Commencing parallel segment instance shutdown, please wait...
.. 20160416:03:47:21:013788 gpstart:sdw0537:gpadmin-[ERROR]:-gpstart error: Do not have enough valid segments to start the array.

Cause

The issue could be caused by timeout of the SSH session to segment hosts before segments are started completely.

RCA

After the gpstart utility is executed, it will connect to each segment host via SSH and run the script gpsegstart.py to start all segments on the host. Once gpsegstart.py finishes on all hosts, gpstart will report the result.

In some cases, segments need to take more time than usual to start up (for example, self-recovery after a crash). If the SSH ClientKeepAlive functionality is enabled, it's possible that the SSH session for running gpsegstart.py might be disconnected early due to timeout before the segment starts up completely. And the error message "Connection to xxx closed by remote host" will be seen in logs, as shown in the example above.

The SSH session timeout value is configurable with the ClientAliveInterval parameter in the SSH daemon configuration file, which in general is /etc/ssh/sshd_config. Setting the ClientAliveInterval to zero will disable the ClientKeepAlive functionality, which means no timeout.

Resolution

In most cases, the GPDB segment will complete startup in about 5 to 10 minutes. But if it fails due to SSH session timeout, then try to disable the SSH KeepAlive feature by setting the ClientAliveInterval to zero or increase it to a value which is sufficient for segment startup. Following are the steps to change the ClientAliveInterval parameter.

  1. On each segment host, change to directory /etc/ssh.
  2. Open file sshd_config with some editor like vi.
  3. Change ClientAliveInterval setting as needed and save the file.
  4. Run "service sshd restart" as root to make the change effective. 

 

Comments

Powered by Zendesk