Pivotal Knowledge Base

Follow

Kernel bug encountered when running Pivotal Greenplum on RHEL 7

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
Operating System RHEL 7.2

Symptom

Greenplum installation is currently impacted by a Linux kernel defect in the most recent GA version of RHEL 7.2 (3.10.0-327), which causes activities within GPDB (most likely query operations) to occasionally fail. The impact of this defect is intermittent lockups of these operations due to network timeouts (a defect in the GNU C library  {glibc}).

The problem surfaces when one of the segments gets stuck in a loop while processing the query. strace on the process will show it to be stuck in a poll call and pstack will show a call to getaddrinfo shortly before it became stuck.

Strace of process with stuck on recvmsg

[root@grnplmprd-05 ~]# strace -p 4711
Process 4711 attached
recvmsg(412,

gdb will show following trace for process

Thread 2 (Thread 0x7f1f03b26700 (LWP 4712)):
#0 0x00007f1eff58a69d in poll () from /lib64/libc.so.6
#1 0x0000000000bbf874 in rxThreadFunc (arg=<optimized out>) at ic_udpifc.c:6277
#2 0x00007f1f0008bdc5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1eff594ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f1f03bd8740 (LWP 4711)):
#0 0x00007f1eff595c5d in recvmsg () from /lib64/libc.so.6
#1 0x00007f1eff5b84cd in make_request () from /lib64/libc.so.6
#2 0x00007f1eff5b89c4 in __check_pf () from /lib64/libc.so.6
#3 0x00007f1eff57ea89 in getaddrinfo () from /lib64/libc.so.6
#4 0x0000000000bb9c81 in setupUDPListeningSocket (listenerSocketFd=0x2599a34, listenerPort=0x7fffcffb270c, txFamily=<optimized out>) at ic_udpifc.c:1231
#5 0x0000000000bbe8dd in startOutgoingUDPConnections (pOutgoingCount=<optimized out>, sendSlice=<optimized out>, transportStates=<optimized out>) at ic_udpifc.c:2987
#6 SetupUDPIFCInterconnect_Internal (estate=<optimized out>) at ic_udpifc.c:3460
#7 SetupUDPIFCInterconnect (estate=<optimized out>) at ic_udpifc.c:3521
#8 0x00000000007548fa in ExecutorStart (queryDesc=<optimized out>, eflags=<optimized out>) at execMain.c:517
#9 0x000000000099de15 in ProcessQuery (portal=<optimized out>, stmt=0x251a300, params=<optimized out>, dest=<optimized out>, completionTag=<optimized out>) at pquery.c:282
#10 0x000000000099ed19 in PortalRunMulti (portal=0x252c520, isTopLevel=1 '\001', dest=<optimized out>, altdest=<optimized out>, completionTag=0x7fffcffb2c90 "") at pquery.c:1603
#11 0x00000000009a0515 in PortalRun (portal=<optimized out>, count=<optimized out>, isTopLevel=0 '\000', dest=<optimized out>, altdest=<optimized out>, completionTag=<optimized out>) at pquery.c:1125
#12 0x00000000009994f3 in exec_mpp_query (localSlice=<optimized out>, seqServerPort=<optimized out>, seqServerHost=<optimized out>, serializedSliceInfolen=<optimized out>, serializedSliceInfo=<optimized out>, serializedParamslen=<optimized out>, serializedParams=<optimized out>, serializedPlantreelen=<optimized out>, serializedPlantree=<optimized out>, serializedQuerytreelen=<optimized out>, serializedQuerytree=<optimized out>, query_string=<optimized out>) at postgres.c:1358
#13 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0x2390788 "dsrprd", username=<optimized out>) at postgres.c:4905
#14 0x00000000008f7eae in BackendRun (port=<optimized out>) at postmaster.c:6963
#15 BackendStartup (port=<optimized out>) at postmaster.c:6658
#16 ServerLoop () at postmaster.c:2464
#17 0x00000000008fac30 in PostmasterMain (argc=15, argv=0x234b4a0) at postmaster.c:1540
#18 0x00000000007fccaf in main (argc=15, argv=0x234b430) at main.c:206

Cause

The cause of the problem has been narrowed down to a kernel bug.

Resolution

The underlying issue cannot be fixed in GPDB and can be fixed by the kernel provider only. The fix is planned to be rolled out in RHEL version 7.3. It is recommended that the user should contact Red Hat and get this fixed.

Additional Information 

For further information, please refer to the following resources: 

GPDB documentation

GPDB Open source discussion 

Red Hat Bugzilla Discussion

Kernel Bug Discussion

 

 

 

Comments

Powered by Zendesk