Pivotal Knowledge Base

Follow

gp_fts_probe_timeout / gp_fts_probe_retries - Explained

Goal

This document will take a look on how the parameter "gp_fts_probe_timeout" &"gp_fts_probe_retries" work.

Checkout the document / article for more information on how FTS works.  

Solution / Example

"gp_fts_probe_timeout" is the allowed timeout for the fault detection process (ftsprobe) to establish a connection to a segment before declaring it down.

"gp_fts_probe_retries" is the number of attempt to probe the segments.

Taking a practical example of the same based on the above description

-- Lets say our default timeout parameter value is

[gpadmin@mdw ~]$ gpconfig -s gp_fts_probe_timeout
Values on all segments are consistent
GUC          : gp_fts_probe_timeout
Master  value: 20s
Segment value: 20s

-- The cluster information with all the primary segments is

flightdata=# select * from gp_segment_configuration where role='p';
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    1 |      -1 | p    | p              | s    | u      |  4340 | mdw      | mdw     |                  |
    2 |       0 | p    | p              | s    | u      | 43400 | sdw3     | sdw3    |            49340 |
    3 |       1 | p    | p              | s    | u      | 43401 | sdw3     | sdw3    |            49341 |
    4 |       2 | p    | p              | s    | u      | 43400 | sdw4     | sdw4    |            49340 |
    5 |       3 | p    | p              | s    | u      | 43401 | sdw4     | sdw4    |            49341 |
(5 rows)

Since the ftsprobe of the master communicate to only its primary segments and not to the mirror (mirror is contacted by the primary segment for its health) , the gp_fts_probe_timeout is the time allowed the primary to response back to master call

-- So the fts probe process information for my cluster that is running on port 4340 is

[gpadmin@mdw ~]$ ps -ef | grep -i fts | grep 4340
gpadmin  16695 16687  0 Feb12 ?        00:00:00 postgres: port  4340, ftsprobe process

-- Let interrupt one of the primary segment process communication by asking the primary process postmaster to sleep

[gpadmin@mdw ~]$ ssh sdw3
Last login: Tue Feb 17 10:45:54 2015 from mdw
[gpadmin@sdw3 ~]$ ps -ef | grep 43400 | grep silent
gpadmin  10769     1  0 Feb12 ?        00:00:05 /usr/local/greenplum-db-4.3.4.0/bin/postgres -D /data1/primary/fai_43400 -p 43400 -b 2 -z 4 --silent-mode=true -i -M quiescent -C 0
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 10769
[gpadmin@sdw3 ~]$

-- On master when the fts probe process kicks again based on the parameter "gp_fts_probe_interval" and turning on the strace for the ftsprobe process to check what is going on , you will find the ftsprobe process goes into FUTEX WAIT mode during this particular ping

[gpadmin@mdw ~]$ strace -p 16695
Process 16695 attached - interrupt to quit
select(0, NULL, NULL, NULL, {36, 403000}
) = 0 (Timeout)
getppid()                               = 16687
lseek(12, 0, SEEK_END)                  = 32768
lseek(13, 0, SEEK_END)                  = 32768
clone(child_stack=0x2b2f68969230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689699d0, tls=0x2b2f68969940, child_tidptr=0x2b2f689699d0) = 18012
clone(child_stack=0x2b2f689aa230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689aa9d0, tls=0x2b2f689aa940, child_tidptr=0x2b2f689aa9d0) = 18013
clone(child_stack=0x2b2f689eb230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689eb9d0, tls=0x2b2f689eb940, child_tidptr=0x2b2f689eb9d0) = 18014
clone(child_stack=0x2b2f68a2c230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68a2c9d0, tls=0x2b2f68a2c940, child_tidptr=0x2b2f68a2c9d0) = 18015
clone(child_stack=0x2b2f68a6d230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68a6d9d0, tls=0x2b2f68a6d940, child_tidptr=0x2b2f68a6d9d0) = 18016
clone(child_stack=0x2b2f68aae230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68aae9d0, tls=0x2b2f68aae940, child_tidptr=0x2b2f68aae9d0) = 18017
clone(child_stack=0x2b2f68aef230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68aef9d0, tls=0x2b2f68aef940, child_tidptr=0x2b2f68aef9d0) = 18018
clone(child_stack=0x2b2f68b30230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68b309d0, tls=0x2b2f68b30940, child_tidptr=0x2b2f68b309d0) = 18019
clone(child_stack=0x2b2f68b71230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68b719d0, tls=0x2b2f68b71940, child_tidptr=0x2b2f68b719d0) = 18020
clone(child_stack=0x2b2f68bb2230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68bb29d0, tls=0x2b2f68bb2940, child_tidptr=0x2b2f68bb29d0) = 18021
clone(child_stack=0x2b2f68bf3230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68bf39d0, tls=0x2b2f68bf3940, child_tidptr=0x2b2f68bf39d0) = 18022
clone(child_stack=0x2b2f68c34230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68c349d0, tls=0x2b2f68c34940, child_tidptr=0x2b2f68c349d0) = 18023
clone(child_stack=0x2b2f68c75230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68c759d0, tls=0x2b2f68c75940, child_tidptr=0x2b2f68c759d0) = 18024
clone(child_stack=0x2b2f68cb6230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68cb69d0, tls=0x2b2f68cb6940, child_tidptr=0x2b2f68cb69d0) = 18025
clone(child_stack=0x2b2f68cf7230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68cf79d0, tls=0x2b2f68cf7940, child_tidptr=0x2b2f68cf79d0) = 18026
clone(child_stack=0x2b2f68d38230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68d389d0, tls=0x2b2f68d38940, child_tidptr=0x2b2f68d389d0) = 18027
futex(0x2b2f689699d0, FUTEX_WAIT, 18012, NULL) = 0

-- The master logs then starts to emit the below information on its log after 20 seconds that it cant communicate to the primary segments.

2015-02-18 03:06:44.095779 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 21000 ms.",,,,,,,0,,,,
2015-02-18 03:06:45.095721 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 1 to probe segment (content=0, dbid=2).",,,,,,,0,,,,
2015-02-18 03:07:05.096777 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,,
2015-02-18 03:07:06.096726 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 2 to probe segment (content=0, dbid=2).",,,,,,,0,,,,
2015-02-18 03:07:26.097782 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,,
2015-02-18 03:07:27.097728 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 3 to probe segment (content=0, dbid=2).",,,,,,,0,,,,
2015-02-18 03:07:47.098792 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,,
2015-02-18 03:07:48.098726 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 4 to probe segment (content=0, dbid=2).",,,,,,,0,,,,
2015-02-18 03:08:09.098705 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20999 ms.",,,,,,,0,,,,
2015-02-18 03:08:09.098740 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) after trying 5 time(s), maximum number of retries reached.",,,,,,,0,,,,
2015-02-18 03:08:09.103340 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=2, content=0) from ('u','p') to ('d','m')",,,,,,,0,,"fts.c",1157,
2015-02-18 03:08:09.103407 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=6, content=0) from ('u','m') to ('u','p')",,,,,,,0,,"fts.c",1157,
2015-02-18 03:08:09.103504 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,x21745,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",976,
2015-02-18 03:08:09.123765 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: mirror (dbid=6) on sdw4:59340 taking over as primary in change-tracking mode.",,,,,,,0,,"ftsfilerep.c",492,
2015-02-18 03:08:09.130391 PST,,,p16695,th1758693696,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=6, content=0) has not reached new state yet, expected state: ('p','c'), reported state: ('m','s').",,,,,,,0,,,,
2015-02-18 03:08:09.248675 PST,,,p16688,th1470440560,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
Success:",,,,,,,,"SysLoggerMain","syslogger.c",552,

-- The ftsprobe retries 5 times (as seen in the above logs) as per parameter "gp_fts_probe_retries" (cannot be modified) before ultimately marking them down.

[gpadmin@mdw ~]$ gpconfig -s gp_fts_probe_retries;
Values on all segments are consistent
GUC          : gp_fts_probe_retries
Master  value: 5
Segment value: 5

-- After the timeout ends and the segments are marked down, the ftsprobe then updates the postmaster on the changes

flightdata=# select * from gp_segment_configuration where status='d';
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    2 |       0 | m    | p              | s    | d      | 43400 | sdw3     | sdw3    |            49340 |

and the remaining trace information of the ftsprobe after the timeout is

select(3, NULL, [2], NULL, {1, 0})      = 1 (out [2], left {1, 0})
write(2, "\0\0\0\0\240\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 208) = 208
select(3, NULL, [2], NULL, {1, 0})      = 1 (out [2], left {1, 0})
write(2, "\0\0\0\0\240\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 208) = 208
select(3, NULL, [2], NULL, {1, 0})      = 1 (out [2], left {1, 0})
write(2, "\0\0\0\0y\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 169) = 169
brk(0x949b000)                          = 0x949b000
brk(0x9494000)                          = 0x9494000
open("base/10899/5095", O_RDWR)         = 15
read(15, "\0\0\0\0\310J\2\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768
read(15, "\0\0\0\0\210\211x\4\1\0\0\0\324\3\360p\360\177\4\200\340\377 \0\320\377 \0\300\377 \0"..., 32768) = 32768
open("base/10899/5094", O_RDWR)         = 16
read(16, "\0\0\0\0H\216\214\4\1\0\1\0\354\3\200K\0\200\4\200\310\377p\0\220\377p\0X\377p\0"..., 32768) = 32768
open("global/5006", O_RDWR)             = 17
lseek(17, 0, SEEK_END)                  = 32768
open("global/6107", O_RDWR)             = 18
read(18, "\0\0\0\0\340&\216\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768
read(18, "\0\0\0\0H\222\372\20\1\0\0\0<\0P\177\360\177\4\200\340\377 \0\320\377 \0\260\377 \0"..., 32768) = 32768
brk(0x94b5000)                          = 0x94b5000
brk(0x94ae000)                          = 0x94ae000
open("base/10899/2678", O_RDWR)         = 19
read(19, "\0\0\0\0 \0\0\0\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768
read(19, "\0\0\0\0\220vr\4\1\0\0\0008\2`w\360\177\4\200\340\377 \0\320\377 \0\300\377 \0"..., 32768) = 32768
open("global/6106", O_RDWR)             = 20
read(20, "\0\0\0\0@&\216\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768
read(20, "\0\0\0\0\0\221\372\20\1\0\0\0<\0P\177\360\177\4\200P\377 \0\340\377 \0\300\377 \0"..., 32768) = 32768
open("pg_xlog/000000010000000000000005", O_RDWR) = 21
lseek(21, 40337408, SEEK_SET)           = 40337408
write(21, "^\320\1\0\1\0\0\0\0\0\0\0\0\200g\26\364\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
fsync(21)                               = 0
kill(16730, SIGUSR1)                    = 0
getppid()                               = 16687
read(3, 0x7fff10ca6300, 16)             = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=-1}], 2, -1) = -1 EINTR (Interrupted system call)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
write(4, "\0", 1)                       = 1
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
read(3, "\0", 16)                       = 1
semctl(390955089, 7, SETVAL, 0)         = 0
select(3, NULL, [2], NULL, {1, 0})      = 1 (out [2], left {1, 0})
write(2, "\0\0\0\0\254\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\0\0\0\0\0"..., 220) = 220
rt_sigaction(SIGINT, {0x1, [], SA_RESTORER, 0x3bd50302d0}, {0x784f10, [], SA_RESTORER|SA_RESTART, 0x3bd5c0eca0}, 8) = 0
rt_sigaction(SIGQUIT, {0x1, [], SA_RESTORER, 0x3bd50302d0}, {0x95c5a0, [], SA_RESTORER|SA_RESTART, 0x3bd5c0eca0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fff10ca54b8) = 18211
wait4(18211, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 18211
rt_sigaction(SIGINT, {0x784f10, [], SA_RESTORER|SA_RESTART, 0x3bd50302d0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {0x95c5a0, [], SA_RESTORER|SA_RESTART, 0x3bd50302d0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

-- The primary segment that was stopped from communication was continued and recovered using gprecoverseg.

[gpadmin@sdw3 ~]$ kill -s SIGCONT 10769
[gpadmin@mdw ~]$ gprecoverseg -a

Conclusion

When the master log reports segments being down due to timeout expiration , this could be due to non-communication of the primary postgres process to the ftsprobe pings which might be due to some of the reason below.

-- Network or server not reachable.

-- Server high load which delay the response ( Check Sar , IOSTAT etc )

-- The postmaster or the primary segments process is stuck or unresponsive due to sudden termination of the child process or anything related ( check the primary logs for more information like OOM, PANIC, FATAL etc )

-- or other factors.

Comments

Powered by Zendesk