- Pivotal Greenplum (GPDB) 4.3.x
- OS RHEL
Analyze_session is a sub-tool packaged with gpmt. Given a GPDB session ID, this utility will scan all segments and look for busy processes associated with that session. Analyze session will then proceed to start tracing the identified processes by running pstack, lsof, gcore/packcore, and strace. All the collected data will be put into a single tarfile and should be given to remote support.
Requirements and Caveats
- lsof, strace, pstack, gcore, gdb must be installed on all hosts in order for this tool to successfully complete.
- gcore will perform a memory dump of the GPDB process and the size could be anywhere form 300MB to serval GigaBytes. Isolating which hosts to collect using "-hostfile" can be key in saving space.
USAGE: gpmt analyze_session [ -session SESSION_ID ] [ -master-dir DIRECTORY ] [ -segment-dir DIRECTORY ] DESCRIPTION: Given GPDB session id analyze_session will look for non-idle gpdb processes and collect diagnostic information. The information collected will be used remotely for root cause analysis. The most common use case for this tool is when a GPDB query process appears hung. OPTIONS: -session GPDB session id which is referenced in pg_stat_activity -master-dir Working directory for master process -segment-dir Working directory for segment processes -free-space default=10 free space threshold which will abort data collection if reached -a Answer yes to all prompts EXAMPLES: Collect Process information for a given GPDB session id gpmt analyze_session -session 12345 Disable prompts gpmt analyze_session -session 12345 -a Change master directory gpmt analyze_session -session 12345 -master-dir /data Change segment directory gpmt analyze_session -session 12345 -segment-dir /data1
Simulating a use case
- Start by kicking off a query that will run for a very long time.
create table Hung_analyzer_UnitTest_12345 as select generate_series(1,999999); select count(*) from Hung_analyzer_UnitTest_12345 a,Hung_analyzer_UnitTest_12345 b, Hung_analyzer_UnitTest_12345 c; drop table hung_analyzer_unittest_12345;
- Then get the session ID from the database.
gpadmin=# select sess_id, current_query from pg_stat_activity ; sess_id | current_query ---------+--------------------------------------------------------------------------------------------------------------------- 7 | select count(*) from Hung_analyzer_UnitTest_12345 a,Hung_analyzer_UnitTest_12345 b, Hung_analyzer_UnitTest_12345 c; 21 | select sess_id, current_query from pg_stat_activity ; (2 rows)
- Run analyze_session to collect information required for remote diagnoses.
[gpadmin@mdw demo]$ gpmt analyze_session -session 7
- The user will be prompted to confirm collection. This prompt offers two benefits:
- Lets the user see how many processes we are going to collect. If there are thousands of processes found, then you may find the collection will be too large. We should then pass in the "-hostfile" option to gpmt in an effort to limit the collection size.
- Secondly, the prompt gives the user a high-level list of only the servers that have the busy processes. This can be useful if you want to quickly check which hosts are busy and which hosts are not. If there are very few busy hosts, then the problem might be processes skew or there is some environmental issue with the small subset of the host.
The following will be collected: mdw: 3098 postgres: port 5432, gpadmin gpadmin 192.168.64.100(29054) con7 192.168.64.100(29054) cmd28 SELECT sdw1.gpmt.local: 2881 postgres: port 40000, gpadmin gpadmin 192.168.64.100(50187) con7 seg0 cmd29 slice3 MPPEXEC SELECT sdw1.gpmt.local: 2883 postgres: port 40001, gpadmin gpadmin 192.168.64.100(19324) con7 seg1 cmd29 slice3 MPPEXEC SELECT sdw1.gpmt.local: 6877 postgres: port 40000, gpadmin gpadmin 192.168.64.100(51067) con7 seg0 cmd29 slice1 MPPEXEC SELECT sdw1.gpmt.local: 6879 postgres: port 40001, gpadmin gpadmin 192.168.64.100(20204) con7 seg1 cmd29 slice1 MPPEXEC SELECT sdw2.gpmt.local: 2881 postgres: port 40000, gpadmin gpadmin 192.168.64.100(33582) con7 seg2 cmd29 slice3 MPPEXEC SELECT sdw2.gpmt.local: 2883 postgres: port 40001, gpadmin gpadmin 192.168.64.100(44039) con7 seg3 cmd29 slice3 MPPEXEC SELECT sdw2.gpmt.local: 6793 postgres: port 40000, gpadmin gpadmin 192.168.64.100(34462) con7 seg2 cmd29 slice1 MPPEXEC SELECT sdw2.gpmt.local: 6795 postgres: port 40001, gpadmin gpadmin 192.168.64.100(44919) con7 seg3 cmd29 slice1 MPPEXEC SELECT Proceed with collection? Yy|Nn (default=Y): y Proceed with collection? Yy|Nn (default=Y): y Starting to collect segment process information Estimated Duration is 180 seconds |==================> |[%36]
Isolate Specific segment hosts
gpmt has a global option "-hostfile" that can be used to isolate which segment servers analyse_session should search. This can help in situations where there is a specific set of servers currently suspected or in cases where we don't need data from all nodes and a collection on a single node is more than adequate.
Process in IO wait or defunct
If analyze_session determines the process is stuck in IO wait or is defunct, then the collection will immediately stop and processes information will be presented to user. Processes stuck in either of these states immediately suggest the root cause for the hung session.