Pivotal Knowledge Base

Follow

Greenplum Cluster Appears to be Frozen or Hung

Environment

Pivotal Greenplum (GPDB) < 4.3.9.x

RHEL 6.x

Overview

Greenplum cluster seems to be frozen.

Symptom

  • All the Greenplum sessions appear to be extremely slow (Not hung).
  • There is a high number of concurrent sessions (More than 100).
  • Very high CPU utilization.
  • Severe performance bottleneck where PostgreSQL processes will take all of the available CPU and bring the system to a seeming halt.
  • The system is not actually hung, normal processing resumes after some period of time when all sessions have completed their locking work.

Checklist

    • pstack output is similar to the one below,
    • The stack consistently contains LWLockAcquire or LWLockRelease (see examples below):

These are the access control locks in memory (LWlocks and s-lock) not to be confused with pg_locks.

Thread 1 (Thread 0x7f0488d9c720 (LWP 26718)):
#0 0x0000003a14ee15e3 in select () from /lib64/libc.so.6
#1 0x0000000000cbce3e in pg_usleep ()
#2 0x000000000094975a in s_lock ()
#3 0x0000000000948b77 in LWLockAcquire ()

Thread 1 (Thread 0x7f0488d9c720 (LWP 26575)): #0  0x0000000000949726 in s_lock () #1  0x00000000009486b2 in LWLockRelease ()
Thread 1 (Thread 0x7f0488d9c720 (LWP 26720)):
#0 0x0000003a14eeb197 in semop () from /lib64/libc.so.6
#1 0x000000000089bd4f in PGSemaphoreLock ()
#2 0x00000000009489f0 in LWLockAcquire ()

  • strace output seen from Greenplum processes will have many waits on the semaphore (semop).
semop(1233944620, {{1, 1, 0}}, 1) = 0 <0.000010>
brk(0x8404000) = 0x8404000 <0.000009>
semop(1233944620, {{0, -1, 0}}, 1) = 0 <0.124067>
semop(1234108465, {{14, 1, 0}}, 1) = 0 <0.000011>
semop(1233944620, {{9, 1, 0}}, 1) = 0 <0.000011>
semop(1234042927, {{3, 1, 0}}, 1) = 0 <0.000013>

Cause

Existing bug that may be triggered by high concurrency (> 30 session). This has been addressed in Greenplum 4.3.9.1 and later.

 RCA

  • Run pstack on all processes once every 30 seconds for 2 minutes.
  • Most stacks will be in the functions described under checklist.
  • strace would constantly show a semop().

Resolution

Throttle workload to try and reduce concurrency. This might free up CPU and cluster may start responding.

Also, upgrade to 4.3.9.1 or later for a permanent solution. 

Comments

Powered by Zendesk