Pivotal Knowledge Base

Follow

gp_segment_connect_timeout - Explained

Environment

Product Version
Pivotal Greenplum All Versions

Goal

This document will take a look on how the parameter "gp_segment_connect_timeout" work.

Checkout the document / article for more information on how FTS works.  

Solution / Example

gp_segment_connect_timeout is the time that the GPDB interconnect will try to connect to a segment instance over the network before timing out. Controls the network connection timeout between master and primary segments, and primary to mirror segment replication processes.

How to check the current value of the parameter, default is 10min.

[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout
Values on all segments are consistent
GUC          : gp_segment_connect_timeout
Master  value: 10min
Segment value: 10min

The current cluster configuration is

flightdata=# select * from gp_segment_configuration order by 2;
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    1 |      -1 | p    | p              | s    | u      |  4340 | mdw      | mdw     |                  |
   10 |      -1 | m    | m              | s    | u      |  4340 | smdw     | smdw    |                  |
    2 |       0 | p    | p              | s    | u      | 43400 | sdw3     | sdw3    |            49340 |
    6 |       0 | m    | m              | s    | u      | 59340 | sdw4     | sdw4    |            56340 |
    7 |       1 | m    | m              | s    | u      | 59341 | sdw4     | sdw4    |            56341 |
    3 |       1 | p    | p              | s    | u      | 43401 | sdw3     | sdw3    |            49341 |
    4 |       2 | p    | p              | s    | u      | 43400 | sdw4     | sdw4    |            49340 |
    8 |       2 | m    | m              | s    | u      | 59340 | sdw3     | sdw3    |            56340 |
    5 |       3 | p    | p              | s    | u      | 43401 | sdw4     | sdw4    |            49341 |
    9 |       3 | m    | m              | s    | u      | 59341 | sdw3     | sdw3    |            56341 |
(10 rows)

Lets pick up the mirror at seg3 sdw3:59431  and lets put its mirror process to sleep

[gpadmin@sdw3 ~]$ ps -ef | grep 59341
gpadmin  31307     1  0 07:07 ?        00:00:00 /usr/local/greenplum-db-4.3.4.0/bin/postgres -D /data2/mirror/fai_43403 -p 59341 -b 9 -z 4 --silent-mode=true -i -M quiescent -C 3
gpadmin  31308 31307  0 07:07 ?        00:00:00 postgres: port 59341, logger process
gpadmin  31312 31307  0 07:07 ?        00:00:00 postgres: port 59341, mirror process
gpadmin  31313 31312  0 07:07 ?        00:00:02 postgres: port 59341, mirror receiver process
gpadmin  31314 31312  0 07:07 ?        00:00:02 postgres: port 59341, mirror consumer process
gpadmin  31315 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror consumer writer process
gpadmin  31316 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror consumer append only process
gpadmin  31317 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror sender ack process
gpadmin  31318 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror verification process
gpadmin  31765 31724  0 07:21 pts/3    00:00:00 grep 59341

[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31312 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31313 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31314 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31315 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31316 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31317 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31318

The primary of the content checks the mirror segments health, so its the primary segments logs that needs to checked on why it reported that mirror segments was unreachable.

-- After 10 minutes, the primary segment logs starts to reports that mirror segment is not able to keep up with the primary

2015-02-18 04:34:20.206197 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=600' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,
2015-02-18 04:36:55.113403 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962' ",,0,,"cdbfilerepprimaryack.c",898,
2015-02-18 04:36:55.113494 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'heartBeat' operation 'heart beat' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962'
position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0b8528' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39006' consume count '39006' ",,0,,"cdbfilerepprimary.c",1432,
2015-02-18 04:37:21.830764 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'shutdown' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963' ",,0,,"cdbfilerepprimaryack.c",898,
2015-02-18 04:37:21.830834 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'shutdown' operation 'shutdown' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963'
position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0ba240' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39007' consume count '39007' ",,0,,"cdbfilerepprimary.c",1311,
2015-02-18 04:37:21.885026 PST,,,p28586,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","filerep main process (PID 30260) exited with exit code 0",,,,,,,0,,"postmaster.c",5876,
2015-02-18 04:37:21.891763 PST,,,p31001,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'sdw4(49341)' mirror address(port) 'sdw3(56341)'",,,,,"mirroring role 'primary role' mirroring state 'change tracking' segment state 'not initialized' process name(pid) 'filerep main process(31001)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3466,
2015-02-18 04:37:21.893572 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: ChangeTracking_RetrieveIsTransitionToInsync() found insync_transition_completed:'true' full resync:'false'",,,,,,,0,,"cdbresynchronizechangetracking.c",2545,
2015-02-18 04:37:21.894052 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","last checkpoint location for generating initial changetracking log 1/38000490",,,,,,,0,,"xlog.c",11560,
2015-02-18 04:37:21.894123 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","scanned through 1 initial xlog records since last checkpoint for writing into the resynchronize change log",,,,,,,0,,"cdbresynchronizechangetracking.c",206,ç

-- The master ftsprobe process during its round to check the primaries health detected that the primary of the content 2 was healthy but primary cant reach its mirrors, so primary of the content 2 requested the ftsprobe to mark the segment down Snippet from master log during that time.

2015-02-18 04:33:31.855817 PST,,,p21713,th2056423744,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=5, content=3) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,,
2015-02-18 04:33:31.855864 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) reported mirroring fault with mirror (dbid=9), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",358,
2015-02-18 04:33:31.855883 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=5, content=3) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",1157,
2015-02-18 04:33:31.855891 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=9, content=3) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",1157,
2015-02-18 04:33:31.855901 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,x21888,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",976,
2015-02-18 04:33:31.856719 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) on sdw4:43401 transitioning to change-tracking mode, mirror marked as down.",,,,,,,0,,"ftsfilerep.c",498,
2015-02-18 04:33:32.032252 PST,,,p21706,th1770300528,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
Success:",,,,,,,,"SysLoggerMain","syslogger.c",552,

-- Master then updates the mirror status in its configuration.

flightdata=# select * from gp_segment_configuration where status='d';
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    9 |       3 | m    | m              | s    | d      | 59341 | sdw3     | sdw3    |            56341 |
(1 row)

Conclusion

If the mirror is down, check the primary segment log for the reason. if the mirror is down due to gp_segment_connect_timeout, it can be unreachable due to the following reasons mentioned below, not an exhaustive list. 

-- Network slowness etc

-- The load on the server etc

-- Server unreachable 

You may increase the timeout, however it is important to understand why we hit that timeout. It may be a case that the system is overloaded and needs more time to keep the mirrors in sync. Please check the admin guide for more details on how to change this parameter. 

Comments

Powered by Zendesk