Pivotal Knowledge Base

Follow

Frequent segment failure due to Out-of-Memory issue in Pivotal Greenplum

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
OS RHEL 6.x

Symptom

Segments in the GPDB cluster went down frequently and randomly even after recovery with gprecoverseg. And some queries, which could work normally before, would fail with an "Out of memory" error.

Error Message:

1. Query failed with "Out of memory" error indicating no available memory resource at the OS level.

2016-09-06 11:40:41.836684 PDT,"gpadmin","ewuat",p7541,th1859113072,"10.96.2.181","51250",2016-09-06 11:35:39 PDT,261298086,con29,cmd487,seg-1,,dx458,x261298086,sx1,"ERROR","53400","Out of memory (seg42 slice4 sdw3:40002 pid=11087)","VM protect failed to allocate 262144 bytes from system, VM Protect 8187 MB available",,,,,"SELECT
2016-09-06 11:52:14.249843 PDT,"asm_bo_user","ewuat",p9689,th1859113072,"10.96.21.195","61099",2016-09-06 11:49:01 PDT,261298270,con99,cmd2,seg-1,,dx553,x261298270,sx1,"ERROR","53400","Out of memory (seg44 slice2 sdw3:40004 pid=13833)","VM protect failed to allocate 262144 bytes from system, VM Protect 8173 MB available",,,,,"SELECT

2.  Segments went down because they failed to allocate memory from the OS to fork new process for connection.

2016-09-07 08:46:21.561702 PDT,,,p30270,th1231917168,,,,0,,,seg-1,,,,,"LOG","00000","could not fork new process for connection: Cannot allocate memory",,,,,,,0,,"postmaster.c",6703,
2016-09-07 08:46:22.561996 PDT,,,p30270,th1231917168,,,,0,,,seg-1,,,,,"LOG","00000","could not fork new process for connection: Cannot allocate memory",,,,,,,0,,"postmaster.c",6703,
2016-09-07 08:46:23.561929 PDT,,,p30270,th1231917168,,,,0,,,seg-1,,,,,"LOG","00000","could not fork new process for connection: Cannot allocate memory",,,,,,,0,,"postmaster.c",6703,

Cause

The virtual memory resource was almost exhausted and had reached the overcommit limit.

RCA

1. The top command showed very low free memory resource: 

Tasks: 763 total, 1 running, 762 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49316564k total, 49051460k used, 265104k free, 216440k buffers
Swap: 50339636k total, 2124012k used, 48215624k free, 43489524k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25173 gpadmin 17 0 790m 404m 920 S 0.0 0.8 0:05.19 gpfdist
25144 gpadmin 17 0 790m 341m 920 S 0.0 0.7 0:05.37 gpfdist
25194 gpadmin 17 0 790m 314m 920 S 0.0 0.7 0:05.55 gpfdist
25105 gpadmin 15 0 790m 299m 1008 S 0.0 0.6 0:05.85 gpfdist
......

2. And many gpfdist processes were consuming lots of memory on each segment host:

=> ps -ef|grep gpfdist |grep -v grep |wc -l
[sdw12] 146
[sdw10] 146
[sdw11] 146
[ sdw4] 146
[ sdw5] 146
[ sdw6] 146
[ sdw7] 146
[ sdw1] 146
[ sdw2] 146
[ sdw3] 146
[ sdw8] 146
[ sdw9] 146

2. Statistics in the /proc/meminfo indicated that the committed virtual memory almost reached to the limitation on all segment hosts:

=> grep Commit /proc/meminfo
[sdw12] CommitLimit: 74997916 kB
[sdw12] Committed_AS: 65318500 kB
[sdw10] CommitLimit: 74997916 kB
[sdw10] Committed_AS: 63471420 kB
[sdw11] CommitLimit: 74973820 kB
[sdw11] Committed_AS: 63722700 kB
[ sdw4] CommitLimit: 74997916 kB
[ sdw4] Committed_AS: 73539588 kB
[ sdw5] CommitLimit: 74997916 kB
[ sdw5] Committed_AS: 71457756 kB
[ sdw6] CommitLimit: 74997916 kB
[ sdw6] Committed_AS: 69150720 kB
[ sdw7] CommitLimit: 74997916 kB
[ sdw7] Committed_AS: 71330908 kB
[ sdw1] CommitLimit: 74997916 kB
[ sdw1] Committed_AS: 64941628 kB
[ sdw2] CommitLimit: 74997916 kB
[ sdw2] Committed_AS: 72657840 kB
[ sdw3] CommitLimit: 74997916 kB
[ sdw3] Committed_AS: 73618052 kB
[ sdw8] CommitLimit: 74997916 kB
[ sdw8] Committed_AS: 66075064 kB
[ sdw9] CommitLimit: 74997916 kB
[ sdw9] Committed_AS: 65577276 kB

Resolution

All the stale gpfdist processes were manually killed to free more memory resources to the OS:

=> ps -ef|grep gpfdist |grep -v grep |awk '{print $2}' |xargs kill

=> grep Commit /proc/meminfo [sdw12] CommitLimit: 74997916 kB
[sdw12] Committed_AS: 7286752 kB
[sdw10] CommitLimit: 74997916 kB
[sdw10] Committed_AS: 7263204 kB
[sdw11] CommitLimit: 74973820 kB
[sdw11] Committed_AS: 7261756 kB
[ sdw4] CommitLimit: 74997916 kB
[ sdw4] Committed_AS: 7736700 kB
[ sdw5] CommitLimit: 74997916 kB
[ sdw5] Committed_AS: 7755148 kB
[ sdw6] CommitLimit: 74997916 kB
[ sdw6] Committed_AS: 7702664 kB
[ sdw7] CommitLimit: 74997916 kB
[ sdw7] Committed_AS: 7781224 kB
[ sdw1] CommitLimit: 74997916 kB
[ sdw1] Committed_AS: 7694444 kB
[ sdw2] CommitLimit: 74997916 kB
[ sdw2] Committed_AS: 7800816 kB
[ sdw3] CommitLimit: 74997916 kB
[ sdw3] Committed_AS: 7708916 kB
[ sdw8] CommitLimit: 74997916 kB
[ sdw8] Committed_AS: 7764884 kB
[ sdw9] CommitLimit: 74997916 kB
[ sdw9] Committed_AS: 7273256 kB

 

 

Comments

Powered by Zendesk