Pivotal Knowledge Base

Follow

How to debug Light Weight Locking issues in GPDB

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
OS All

Purpose

In more recent versions of GPDB (> 4.3.5.0) we've noticed an increase in the number of hung systems where various GPDB processes are waiting on a Light Weight Lock (LWLock). Debugging these systems can be difficult and requires some additional information beyond what is required in How to RCA a Hung Query.

LWLocks are not the same as database locks. LWLocks is the locking implementation used by Greenplum (and Postgres) and are used to lock memory structures when a process needs to make modifications. The locking only happens between Greenplum processes associated with a master or specific segment database. That means that there will never be a lock from a segment waiting on a lock from another segment. So when troubleshooting you should only be concerned with the specific master or segment instance where the lock is held.

Cause

This debugging is going to be typically needed when you run pstack against a number of processes and they all appear to be in the following functions:

  • pg_usleep()
  • s_lock()
  • LWLockAcquire() or LWLockRelease()

Here is an example:

Thread 1 (Thread 0x7f3280351720 (LWP 10536)):
#0  0x0000003a1520f5db in raise () from /data/logs/17943/core/lib64/libpthread.so.0
#1  0x0000000000ac664a in StandardHandlerForSigillSigsegvSigbus_OnMainThread (processName=<value optimized out>, postgres_signal_arg=11) at elog.c:4488
#2  <signal handler called>
#3  0x0000003a14ee15e3 in select () from /data/logs/17943/core/lib64/libc.so.6
#4  0x0000000000cbce3e in pg_usleep (microsec=<value optimized out>) at pgsleep.c:43
#5  0x000000000094975a in s_lock (lock=0x7f325bd3a500 "\001\001", file=0x0, line=<value optimized out>) at s_lock.c:114
#6  0x0000000000948b77 in LWLockAcquire (lockid=MirroredLock, mode=<value optimized out>) at lwlock.c:470
#7  0x00000000004d0fc7 in heap_release_fetch (relation=<value optimized out>, snapshot=<value optimized out>, tuple=0x7f32594b9d38, userbuf=0x7f32594b9d50, keep_buf=<value optimized out>, stats_relation=<value optimized out>) at heapam.c:1763
#8  0x00000000004e3506 in index_getnext (scan=0x7f32594b9cf0, direction=<value optimized out>) at indexam.c:478
#9  0x00000000004e2b21 in systable_getnext (sysscan=0x7f32594b9bc8) at genam.c:246
#10 0x00000000005c09c1 in caql_getnext (pCtx=0x7fff74125790) at catquery.c:19007
#11 0x0000000000aa2a82 in RelationBuildTupleDesc (relation=0x7f3258648400) at relcache.c:928
#12 0x0000000000aa7f5e in RelationBuildDesc (targetRelId=<value optimized out>, insertIt=<value optimized out>) at relcache.c:1312
#13 0x0000000000aa8de5 in RelationIdGetRelation (relationId=17080817) at relcache.c:1965
#14 0x00000000004d49f6 in relation_open (relationId=17080817, lockmode=0) at heapam.c:880
#15 heap_open (relationId=17080817, lockmode=0) at heapam.c:1283
#16 0x000000000087f707 in expand_inherited_rtentry (root=<value optimized out>) at prepunion.c:913
#17 expand_inherited_tables (root=<value optimized out>) at prepunion.c:793
#18 0x0000000000861828 in subquery_planner (parse=0x7f327a3b6d38, cursorOptions=<value optimized out>, boundParams=<value optimized out>) at planner.c:651
#19 standard_planner (parse=0x7f327a3b6d38, cursorOptions=<value optimized out>, boundParams=<value optimized out>) at planner.c:405
#20 0x0000000000862a3e in planner (parse=0x19e2360, cursorOptions=0, boundParams=0x0) at planner.c:305
#21 0x000000000095cad9 in pg_plan_query (query_string=0x19d46ce "select round(sum(long)::numeric/count(*)*100,1) as long, round(sum(failed)::numeric/count(*)*100,1) as failed from (select case when request_duration > 1000 then 1 else 0 end as long, case when reques"..., stmt_name=<value optimized out>, paramTypes=0x0, numParams=0) at postgres.c:809
#22 pg_plan_queries (query_string=0x19d46ce "select round(sum(long)::numeric/count(*)*100,1) as long, round(sum(failed)::numeric/count(*)*100,1) as failed from (select case when request_duration > 1000 then 1 else 0 end as long, case when reques"..., stmt_name=<value optimized out>, paramTypes=0x0, numParams=0) at postgres.c:882
#23 exec_parse_message (query_string=0x19d46ce "select round(sum(long)::numeric/count(*)*100,1) as long, round(sum(failed)::numeric/count(*)*100,1) as failed from (select case when request_duration > 1000 then 1 else 0 end as long, case when reques"..., stmt_name=<value optimized out>, paramTypes=0x0, numParams=0) at postgres.c:2042
#24 0x000000000095dbca in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x18d72f0 "dwh", username=<value optimized out>) at postgres.c:4961
#25 0x00000000008ba85e in BackendRun () at postmaster.c:6987
#26 BackendStartup () at postmaster.c:6680
#27 ServerLoop () at postmaster.c:2481
#28 0x00000000008bd5f0 in PostmasterMain (argc=18, argv=0x188fd28) at postmaster.c:1558
#29 0x00000000007bd60f in main (argc=18, argv=0x188fca0) at main.c:206

Procedure

When you suspect a problem with LWLocks, please perform the following:

  1. Generate a core file of any process waiting on locks and then collect it using packcore:
    gcore <PID>
  2. In a empty directory, run this script to dump stacks and held locks for all postgres processes on the server.
    #!/bin/bash
    
    unset PYTHONHOME
    unset LD_LIBRARY_PATH
    
    GDB=/usr/bin/gdb
    
    ps -o state,uid,pid,ppid,c,pri,ni,rss,sz,wchan=WIDE-WCHAN-COLUMN,stime,tty,time,cmd -Cpostgres > dumps.ps
    
    for proc in $(ps -flyCpostgres --no-heading | awk '{print $3}')
    do
        $GDB --batch -ex "print held_lwlocks" -ex "thread apply all bt" /proc/$proc/exe $proc > dumps.$proc
        echo "" >> dumps.$proc
        cat /proc/$proc/stack >> dumps.$proc
    done
  3. If you have the knowledge to debug the above outputs then make an attempt to debug. If not then the process which is waiting on the lock should be terminated using usual methods.
  4. Collect all the above logs and contact an SME for assistance in debugging the logs.

Additional Information

For the core file, there is a GDB macro available at:

  • https://github.com/pivotal-mike/gdb_macros/tree/master/lwlocks

Download the lwlocks.py and then execute the following inside gdb:

(gdb) source /path/to/lwlocks.py
(gdb) dump-lwlock-array
NullLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
BufFreelistLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ShmemIndexLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
OidGenLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
XidGenLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ProcArrayLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SInvalReadLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SInvalWriteLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FreeSpaceLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
WALInsertLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
WALWriteLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ControlFileLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
CheckpointLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
CheckpointStartLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
CLogControlLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SubtransControlLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
MultiXactGenLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
MultiXactOffsetControlLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
MultiXactMemberControlLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
RelCacheInitLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
BgWriterCommLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
TwoPhaseStateLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
TablespaceCreateLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
BtreeVacuumLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
AddinShmemInitLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
AutovacuumLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
AutovacuumScheduleLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SharedSnapshotLock: mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
DistributedLogControlLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SeqServerControlLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
AOSegFileLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
PersistentObjLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FileRepShmemLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FileRepAckShmemLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FileRepAckHashShmemLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ChangeTrackingTransitionLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ChangeTrackingWriteLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ChangeTrackingCompactLock:  mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
MirroredLock:   mutex = 1 '\001', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 37, exclusivePid = 0
ResQueueLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FileRepAppendOnlyCommitCountLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
SyncRepLock:    mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
ErrorLogLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
FirstWorkfileMgrLock:   mutex = 0 '\000', releaseOK = 1 '\001', exclusive = 0 '\000', shared = 0, exclusivePid = 0
(gdb)

 

Comments

Powered by Zendesk