Pivotal Knowledge Base

Follow

What to collect to RCA a hung query ?

Overview

Often it is necessary to terminate a query that is 'hung'; where a session is either running for an excessively long period of time or the session, though active is no longer progressing. Before terminating the processes associated with the query the following steps can be used to collect materials needed in most cases to provide an RCA.

Note: These steps are now automated in the analyze session utility

What to collect

1. Basic DB Queries

  1. List of all running queries:
    select * from pg_stat_activity;
    
  2. List of locks on master & all segments:
    select gp_segment_id, * from pg_locks;
    select * from gp_toolkit.gp_locks_on_relation;
    select * from gp_toolkit.gp_locks_on_resqueue;
    
  3. Dump current status of resource queues:
    select * from gp_toolkit.gp_resq_activity;
    select * from gp_toolkit.gp_resqueue_status
    

2. Basic process information

On all hosts (master & segment) we need to collect basic process information. Given that SESS_ID is the session ID of the hung session, collect the following output:

  1. List of all processes running on all segments:
    gpssh -f hostfile 'ps -elfy'
    
  2. List of all postgres processes running on all nodes:
    gpssh -f hostfile 'ps -flyCpostgres'
    
  3. List of all processes for the 'hung' process:
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID'
    
  4. Verify if any of the processes are stuck in IO wait. This may indicate an issues with either the OS or the underlying hardware:
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^D'
    
  5. Finally list any running processes for the query in question that are running
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^R'
    

IMPORTANT: If you find there are processes that are consistently in IO wait, please halt at this step and verify if there is any OS or hardware related issue (e.g. XFS file system bug, lost NFS mount, faulty RAID controller). Processes stuck in IO wait are often the cause of the 'hung' query and the underlying issue will need to be addressed before going any further.

3. What to Collect from Processes

Based on the above collections we can now start to collect information on the processes themselves. There are some caveats that you must be aware of:

  • Do NOT attempt to collect information on any process that is IO wait (see step #4 above). This will hang.
  • At this point, once we've ruled out OS/HW issues we either have a communication failure or a processes that is actually running but is taking a longer than expected time to complete it's work.
  • For software-only installs the following utilities will need to be installed prior to proceeding:
    • gdb
    • pstack
    • strace
    • lsof

    To begin, we will be collecting the following information on each segment node including the master.

  1. pstack output for all session processes:
    ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do pstack $pid > `hostname`-pstack.$pid; done
    
  2. lsof output for all sessions processes:
    ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do /usr/sbin/lsof -p $pid > `hostname`-lsof.$pid; done
    
  3. For any process that is running we will need to collect some additional information:
    1. A core file for each running process:
      ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do gcore -o `hostname`-gcore.$pid $pid; done
      
    2. Generate a command file to collect the strace output from each of the running processes:
      ps -flyCpostgres | grep conSESS_ID | grep ^R | awk '{print $3}' | xargs -n1 -I '{}' echo '(strace -p {} -o strace -ff )  & sleep 15 ; kill $!' > run_strace.sh
      
    3. Run the above command file and wait for it to complete
      bash ./run_stace.sh
      

Comments

Powered by Zendesk