When attempting to run gprecoverseg -v, recovery fails with an error. This can also be seen in other Pivotal Greenplum tools like gpstate, gpstart, gpstop, or any other tool that spawns many ssh workers to the segments.
20170701:10:16:17:043725 mdw:gpadmin-[DEBUG]:-[worker10] finished cmd:
Get segment status cmdStr='ssh -o
'StrictHostKeyChecking no' bdtcstr26n5 ". /greenplum/greenplum-db/./greenplum_path.sh;
$GPHOME/bin/gp_primarymirror -h sdw1 -p 40007"' had result: cmd had rc=1 completed=True halted=False stdout='' stderr='mode: PrimarySegment segmentState: Ready dataState: InChangeTracking faultType: NotInitialized mode: PrimarySegment segmentState: Ready dataState: InChangeTracking faultType: NotInitialized '
20170701:10:16:17:043725 gprecoverseg:mdw:gpadmin-[DEBUG]:-Encountered error
Not ready to connect to database 20170701:10:16:17:043725
gprecoverseg:mdw:gpadmin-[INFO]:-Unable to connect to database.
"Not ready to connect" is a generic error and an indication that something went wrong when obtaining segment information. Before that error, you will find a status of each segment. One of the results would have ssh return code 255 and empty stderr string. This can happen randomly in different segments and the syntax is like in the example below:
20170913:03:16:48:029516 gprecoverseg:sdw1:gpadmin-[DEBUG]:-[worker11] finished cmd:
Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' sdw1 ". /greenplum/greenplum-db/./greenplum_path.sh;
$GPHOME/bin/gp_primarymirror -h sdw1 -p 40003"'
had result: cmd had rc=255 completed=True halted=False stdout='' stderr=''
If you find a line with a return code
rc=255 and an empty stderr output, that is an indication that ssh failed. This usually happens on big Greenplum environments and a stressed network. When gprecoverseg has to recover a lot of segments it has to create a lot of ssh connections. On some systems, this might fail due to a non-optimized ssh daemon. Extra information can be found looking at /vat/log/secure.
This workaround has been able to fix this in some big environments where there were more than ~300 segments in the cluster and this error was seen consistently:
- Take note of the current value of MaxStartups and ClientAliveInterval in /etc/ssh/sshd_config (in all segment servers in the cluster)
- Set the parameter MaxStartups to 100 in the /etc/ssh/sshd_config file (in all segment servers in the cluster)
- Comment out the ClientAliveInterval (#ClientAliveInterval) in the /etc/ssh/sshd_config file (in all segment servers in the cluster)
- Restart the ssh daemon in all servers