Pivotal Knowledge Base

Follow

How to interpret the mem_watcher output (memreport)

Environment

Pivotal Greenplum (GPDB)

Purpose

I have installed mem_watcher using the steps mentioned in the article.

Now I see "Out of memory" issue on my logs, how can I use the mem_watcher to identify the reason for the high memory usage and how do I understand the output generated by the mem_watcher demon?

Solution

-- The master log indicates there is segment going down / query crashing due to out of memory and it also indicate that first occurrence of the issue (Out of memory) was reported on sdw5.

[gpadmin@mdw.gphd.local]$  egrep "Out of memory|VM Protect" gpdb-2015-04-05_000000.csv | head -5
2015-04-05 13:15:00.446356 AST,"yyyy","xxxxx",p289315,th1929869088,"10.24.27.6","54872",2015-04-05 12:59:53 AST,117929745,con223649,cmd5,seg-1,,dx539105,x117929745,sx1,"ERROR","53400","Out of memory  (seg39 sdw5.gphd.local:1026 pid=33919)","VM protect failed to allocate 9432 bytes from system, VM Protect 1738 MB available",,,,"SQL statement "" select xxxx from xxxxx "

-- From the sdw5 seg39:1026 logs for that date and time , it indicates the same issue.

2015-04-05 13:15:06.469127 AST,"yyyyy","xxxx",p318858,th1929869088,"172.24.8.203","32034",2015-04-05 13:14:49 AST,117934545,con224062,cmd4,seg-1,,dx542885,x117934545,sx1,"ERROR","53400","Out of memory  (seg39 slice21 sdw5.gphd.local:1032 pid=552958)","VM protect failed to allocate 65536 bytes from system, VM Protect 1703 MB available",,,,,

-- After following the guide, mem_watcher demon was stopped and the below gzip files are copied to the destination

[gpadmin@mdw.gphd.local]$ ls -ltr | head
total 945552
-rw------- 1 gpadmin gpadmin 67456477 Apr  5 17:45 mdw.ps.out.gz
-rw------- 1 gpadmin gpadmin 52737601 Apr  5 17:45 smdw.ps.out.gz
-rw------- 1 gpadmin gpadmin 98228335 Apr  5 17:45 sdw1.ps.out.gz
-rw------- 1 gpadmin gpadmin 96828623 Apr  5 17:45 sdw2.ps.out.gz
-rw------- 1 gpadmin gpadmin 96604075 Apr  5 17:45 sdw3.ps.out.gz
-rw------- 1 gpadmin gpadmin 98378869 Apr  5 17:45 sdw4.ps.out.gz
-rw------- 1 gpadmin gpadmin 98314429 Apr  5 17:45 sdw5.ps.out.gz
-rw------- 1 gpadmin gpadmin 97213238 Apr  5 17:45 sdw6.ps.out.gz
-rw------- 1 gpadmin gpadmin 97972956 Apr  5 17:45 sdw7.ps.out.gz

-- As sdw5 is the point of interest for us, run the memreport executable (attached to this article) using the command below

/path/to/memreport sdw5.ps.out.gz

-- The memreport will start creating file with 1 minute difference and format the RAW files in much more readable format.

[.....]
-rw------- 1 gpadmin gpadmin     2417 Apr  5 17:52 20150405-173533
-rw------- 1 gpadmin gpadmin     2225 Apr  5 17:52 20150405-173633
-rw------- 1 gpadmin gpadmin     2417 Apr  5 17:52 20150405-173733
-rw------- 1 gpadmin gpadmin     2225 Apr  5 17:52 20150405-173834
-rw------- 1 gpadmin gpadmin     2225 Apr  5 17:52 20150405-173934
-rw------- 1 gpadmin gpadmin     2417 Apr  5 17:52 20150405-174034
-rw------- 1 gpadmin gpadmin     2225 Apr  5 17:52 20150405-174134
-rw------- 1 gpadmin gpadmin     2417 Apr  5 17:52 20150405-174234
-rw------- 1 gpadmin gpadmin     2225 Apr  5 17:52 20150405-174334
-rw------- 1 gpadmin gpadmin     3539 Apr  5 17:52 20150405-174434

-- Since the issue happened across the time "2015-04-05 13:15:06.469127 AST" as indicated from the error message on the sdw5 server log.
-- Open the respective time file (can be a minute before or after the issue reported above), the goal is to open the file that is very close to the time when the error has occurred.

Here for eg.s the closest log is at 2015-04-05 13:14:40, so opening the file 

[gpadmin@mdw.gphd.local]$ less 20150405-131440

-- The first line is the time when the information collected

Polling Timestamp:  2015-04-05T13:14:40

-- Then summary of the number of process and the total OS memory used.

NODE SUMMARY
============

Total Number of Processes:                        1079
Total Memory used (MB)                            63128.4062

-- The process that was in IO sleep ( status = D in ps -efly output)

I/O sleep processes
-------------------

PID             TIME  WCHAN      COMMAND
544453      00:09:59  xfs_bu     postgres: port  1032, yyyy xxxx 172.28.8.250(38727) con223649 seg39 cmd5 slice4 MPPEXEC INSERT
544449      00:09:30  sync_p     postgres: port  1030, yyyy xxxx 172.28.8.250(56732) con223649 seg37 cmd5 slice4 MPPEXEC INSERT
544447      00:10:00  sync_p     postgres: port  1029, yyyy xxxx 172.28.8.250(53128) con223649 seg36 cmd5 slice4 MPPEXEC INSERT
544445      00:10:03  sync_p     postgres: port  1028, yyyy xxxx 172.28.8.250(48279) con223649 seg35 cmd5 slice4 MPPEXEC INSERT

-- The information of the process that has consumed CPU more than 50% on that host ( for egs. sdw5 in our example )

Processes with CPU > 50%
------------------------------

PID       %CPU  COMMAND
544443    67.3  postgres: port  1027, yyyy xxxx 172.28.8.250(2617) con223649 seg34 cmd5 slice4 MPPEXEC INSERT
540691    98.8  postgres: port  1027, yyyy xxxx 172.28.8.250(56727) con223472 seg34 cmd11 slice2 MPPEXEC UPDATE
544453    67.6  postgres: port  1032, yyyy xxxx 172.28.8.250(38727) con223649 seg39 cmd5 slice4 MPPEXEC INSERT
544451    68.5  postgres: port  1031, yyyy xxxx 172.28.8.250(60721) con223649 seg38 cmd5 slice4 MPPEXEC INSERT
540701    98.6  postgres: port  1032, yyyy xxxx 172.28.8.250(28582) con223472 seg39 cmd11 slice2 MPPEXEC UPDATE
540689    98.8  postgres: port  1026, yyyy xxxx 172.28.8.250(27633) con223472 seg33 cmd11 slice2 MPPEXEC UPDATE
540695    98.3  postgres: port  1029, yyyy xxxx 172.28.8.250(42983) con223472 seg36 cmd11 slice2 MPPEXEC UPDATE
544449    64.3  postgres: port  1030, yyyy xxxx 172.28.8.250(56732) con223649 seg37 cmd5 slice4 MPPEXEC INSERT
544441    65.2  postgres: port  1026, yyyy xxxx 172.28.8.250(37778) con223649 seg33 cmd5 slice4 MPPEXEC INSERT
544447    67.6  postgres: port  1029, yyyy xxxx 172.28.8.250(53128) con223649 seg36 cmd5 slice4 MPPEXEC INSERT
544439    68.8  postgres: port  1025, yyyy xxxx 172.28.8.250(60347) con223649 seg32 cmd5 slice4 MPPEXEC INSERT
540699    97.8  postgres: port  1031, yyyy xxxx 172.28.8.250(50576) con223472 seg38 cmd11 slice2 MPPEXEC UPDATE
544445    67.9  postgres: port  1028, yyyy xxxx 172.28.8.250(48279) con223649 seg35 cmd5 slice4 MPPEXEC INSERT
540697    98.2  postgres: port  1030, yyyy xxxx 172.28.8.250(46587) con223472 seg37 cmd11 slice2 MPPEXEC UPDATE
540693    98.4  postgres: port  1028, yyyy xxxx 172.28.8.250(38134) con223472 seg35 cmd11 slice2 MPPEXEC UPDATE
540687    98.8  postgres: port  1025, yyyy xxxx 172.28.8.250(50202) con223472 seg32 cmd11 slice2 MPPEXEC UPDATE

-- The information of the process that has consumed memory more than 5% on that host ( for egs. sdw5 in our example )

Processes with %MEM > 5%
--------------------

PID       %%MEM COMMAND
540691    9.6   postgres: port  1027, yyyy xxxx 172.28.8.250(56727) con223472 seg34 cmd11 slice2 MPPEXEC UPDATE
540701    9.6   postgres: port  1032, yyyy xxxx 172.28.8.250(28582) con223472 seg39 cmd11 slice2 MPPEXEC UPDATE
540689    9.7   postgres: port  1026, yyyy xxxx 172.28.8.250(27633) con223472 seg33 cmd11 slice2 MPPEXEC UPDATE
540695    9.5   postgres: port  1029, yyyy xxxx 172.28.8.250(42983) con223472 seg36 cmd11 slice2 MPPEXEC UPDATE
540699    9.8   postgres: port  1031, yyyy xxxx 172.28.8.250(50576) con223472 seg38 cmd11 slice2 MPPEXEC UPDATE
540697    9.8   postgres: port  1030, yyyy xxxx 172.28.8.250(46587) con223472 seg37 cmd11 slice2 MPPEXEC UPDATE
540693    9.9   postgres: port  1028, yyyy xxxx 172.28.8.250(38134) con223472 seg35 cmd11 slice2 MPPEXEC UPDATE
540687    9.9   postgres: port  1025, yyyy xxxx 172.28.8.250(50202) con223472 seg32 cmd11 slice2 MPPEXEC UPDATE

-- The memory consumption / segment wise ( as you can see most of the segments has close to 7G consumed by user queries )

SEGMENT SUMMARY
===============

Segment             Total (MB)   Session (MB)     Background (MB)       # Active         # Idle
32                     7869.23        7209.71              369.23              9              5
33                     7580.93        7096.49              344.68              9              5
34                     7552.13        7043.05              367.73              9              5
35                     7655.19        7172.38              343.32              9              5
36                     7638.15        6974.20              373.67              9              5
37                     7642.78        7188.83              371.66              9              5
38                     7648.50        7141.01              366.80             10              4
39                     7610.16        6975.55              346.16              9              5
41                      103.55           0.00               88.93              0              0
46                      363.47           0.00               75.39              0              0
47                       87.19           0.00               81.55              0              0
49                       69.05           0.00               65.46              0              0
54                      380.67           0.00               90.52              0              0
55                      363.20           0.00               75.14              0              0
62                      141.37           0.00               65.88              0              0
63                       85.13           0.00               72.58              0              0

-- The top process w.r.t the segments and the memory consumption, from the below output the "con223472" which executed the command 11 has used up close 6G of memory on most segments , so adding to all the primary segments in the cluster the total consumption of the process is 50G, which is very huge.

So this process query should be a point of concern and needs to be investigated.

                         seg32     seg33     seg34     seg35     seg36     seg37     seg38     seg39     seg41     seg46     seg47     seg49     seg54     seg55     seg62     seg63
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[......]
[......] con223240:idle 160.32 147.74 146.27 145.79 147.50 160.07 0.00 138.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1045.94 con223472:cmd11 6382.97 6313.14 6241.47 6392.58 6169.64 6368.42 6330.21 6203.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50401.64 [......]
[......]

-- Now to identify the query navigate to the master log and search for con223472 and the command "cmd11" , sometime the command indicated by the report might have actually failed ( for example below ) , in that case check the previous command ( i.e 10th ) command

2015-04-05 13:15:24.437496 AST,"yyyy","xxxx",p260492,th1474799360,"[local]",,2015-04-05 12:45:02 AST,117934601,con223472,cmd11,seg-1,,dx542903,x117934601,sx1,"LOG","00000","Dispatcher encountered connection error on seg32 slice2 sdw5.gphd.local:1026 pid=777162: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
",,,,,,,0,,,,

-- The 10th is the cause of the problem.

2015-04-05 12:53:19.061529 AST,"yyyyy","xxxx",p260492,th1929869088,"10.103.5.55","22351",2015-04-05 12:53:12 AST,117919425,con223472,cmd10,seg-1,,dx529095,x117919425,sx1,"LOG","00000","statement: select xxxxx from zzzzz "

-- Once you find the query , get the explain plan and understand the area for the high memory consumption, if stats are missing etc to reduce the memory usage of the query.

Additional Notes

There are occurrence when the process consumed memory, failed with "Out of Memory" and the memory was released all in under 1 minute.

The mem_watcher when it woke up after 60 seconds couldn't log those information as the process no longer existed on the cluster.

To reduce the time of sleep in the mem_watcher utility, you can open the mem_watcher file and reduce the time in the line below to whatever is convenient to capture more information on the process and restart the demon (say capture every 10 seconds).

Please note: Changing to a lower value will generate more files on the disk ( small KB files ) , so ensure you have sufficient disk space and also change the value back once you have captured the information.

[.....]
def runPSProcess(sleepInt=60):
[.....]

Comments

  • Avatar
    Vijayakumar Mani
    Im getting error while running memreport, Could you please help [gpadmin@mdw space]$ ls -lrt total 2052 -rw------- 1 gpadmin gpadmin 258946 Jan 19 14:30 smdw.ps.out.gz -rw------- 1 gpadmin gpadmin 376526 Jan 19 14:30 sdw3.ps.out.gz -rw------- 1 gpadmin gpadmin 375416 Jan 19 14:30 sdw2.ps.out.gz -rwxrwxrwx 1 gpadmin gpadmin 303356 Jan 19 14:30 mdw.ps.out.gz -rw------- 1 gpadmin gpadmin 374865 Jan 19 14:30 sdw4.ps.out.gz -rw------- 1 gpadmin gpadmin 375033 Jan 19 14:30 sdw1.ps.out.gz [gpadmin@mdw space]$ /usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport smdw.ps.out.gz >>>16:01:19:13:59:40<<< Traceback (most recent call last): File "/usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport", line 407, in writeFile('.', *sample) File "/usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport", line 379, in writeFile ProcessListing.process(e) File "/usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport", line 34, in process return PostmasterProcessListing(ps_dictionary) File "/usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport", line 135, in __init__ self.segID = int(ps_dictionary['COMMAND'].split()[-1]) ValueError: invalid literal for int() with base 10: '-E'
  • Avatar
    Faisal Ali

    Hi,

    I guess i know why the issue is occuring.

    The piece of code that is failing is

    self.segID = int(ps_dictionary['COMMAND'].split()[-1])

    The command that is trying to read is this and its picks that last word from the end which is assumes is a integer.

    [gpadmin@gpdbsne pg_log]$ ps -eo cmd | grep silent
    /usr/local/greenplum-db-4.3.6.2/bin/postgres -D /data/primary/gp4.3.6.2_181220150358310 -p 30010 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0
    /usr/local/greenplum-db-4.3.6.2/bin/postgres -D /data/primary/gp4.3.6.2_181220150358311 -p 30011 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1
    /usr/local/greenplum-db-4.3.6.2/bin/postgres -D /data/master/gp4.3.6.2_18122015035831-1 -p 3005 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E

    But you have send comamnd to read the master log , it doesnt end with a integer but with a literal i.e -E and hence the issue

    [gpadmin@mdw space]$ /usr/local/pivotal-support_1.4/usr/local/pivotal-support/bin/memreport smdw.ps.out.gz 

    Its a programming bug and thanks for bringing this to notice, i will update the dev team on it ...

Powered by Zendesk