Queries failed due to the error message: "VM protect failed to allocate %d bytes from the system, VM Protect %d MB available", What does the error message mean and where can I check and understand the reason for the error?
What does "VM protect failed to allocate %d bytes from the system, VM Protect %d MB available" mean?
When the amount of virtual memory used by Greenplum queries is less than the limit defined by gp_vmem_protect_limit, and an attempt to allocate memory is made for a query but if the Operating System falls short of allocating memory even though after allocation the memory used by Greenplum will be still below the limit defined by gp_vmem_protect_limit, the error noted above is seen.
This suggests that the system does not have enough memory for Greenplum queries which can be due to misconfiguration, high memory usage of the servers by the non-GPDB processes, or some unaccounted memory consuming process.
Due to the system running off the memory, the database segment will experience a PANIC and will go into recovery mode to recover the failed transaction. This will cause all the existing queries to fail.
Are there some statement / or execution that can't be controlled via resource queue / gp_vmem_protect_limit or any other parameters?
Yes, There are certain memory allocations made for a query execution that are NOT controlled by resource queues / gp_vmem_protect_limit or statement_mem etc and this can consume memory directly from the OS, These memory allocations will primarily include memory used to hold the plan generated which will be dispatched for the query
Memory consumption due to query plan.
- On the master -> Uncompressed size of the query plan
- Total memory consumption on a server to handle in memory -> Uncompressed size of the query plan * Number of active primary segments
In order to view the plan size, you can set gp_log_gang GUC to debug, and review the logs to view the query plan size.
warehouse=# set gp_log_gang=debug; SET warehouse=# explain analyze select * from abc t1, abc t2 where t1.a=t2.a;
After you run the above query, logs will include an entry like below:
2015-03-11 06:03:20.288583 PDT,"bhuvnesh","warehouse",p5919,th-1598946860,"[local]",,2015-03-11 06:03:07 PDT,13258,con14,cmd3,seg-1,,dx52,x13258,sx1,"LOG","00000","Query plan size to dispatch: 3KB",,,,,,"explain analyze select * from abc t1, abc t2 where t1.a=t2.a ;",0,,"cdbdisp.c",4013,
The size of maximum query plan which will be dispatched can be limited by setting the value of gp_max_plan_size. If there is any query, whose uncompressed plan size will exceed the gp_max_plan_size, this query will not be dispatched to the segments / executed and an error will be reported similar to the error below.
warehouse=# set gp_max_plan_size='2kB'; SET warehouse=# explain analyze select * from abc t1, abc t2 where t1.a=t2.a; ERROR: Query plan size limit exceeded, current size: 3KB, max allowed size: 2KB HINT: Size controlled by gp_max_plan_size
The above message indicates, the plan for the query is 3KB but the max allowed plan is 2KB, this is also one method to identify the plan size of the query.
You may set a value of gp_max_plan_size depending on the average size of the plans used by the queries on the database to identify any potential queries whose requirement for memory due to plan size can impact heavily.
One must keep this aspect in mind during their analysis and is often a good idea to isolate if the OOM is not due to plan size.
Some basic troubleshooting steps
There are several factors which can result in OS OOM issues. Below is a list of checks which one must do during troubleshooting.
1. Validate the value of gp_vmem_protect_limit. gp_vmem_protect_limit maximum limit must be calculated based on the calculation as provided in Greenplum Documentation.
(SWAP + (RAM * vm.overcommit_ratio)) .9 / primary number_segments_per_server
Please note that in the above a multiplication factor of .9 is used which leaves 10% of virtual memory available for other processes ( like the kernel, segment mirrors etc ). However, if there are several non-GPDB processes, ex. java programs, other memory consuming processes, you must reduce .9 to lower values to allow more memory for other processes.
2. If at a point, number of active primaries on a segment was higher than the default number of primaries and gp_vmem_protect_limit is configured using a multiplication factor ~ .9 , the segment server will be limited in memory and might not be able to allocate required gp_vmem_protect_limit to the segments due to the temporary activation of some primaries ( for eg.s some of the primaries failed and mirrors is now acting as primaries on the server )
You can verify the gp_segment_configuration and gp_configuration_history to review the status of segments around the time of OOM.
select * from gp_segment_configuration where role<>preferred_role;
In this case, OOM will be expected and the immediate resolution will be to recover the segments and bring the segments to their preferred role.
If the servers have high memory modules, ex. 126 GB, 256 GB, 512 MB RAM or higher, one can revisit the value of gp_vmem_protect_limit to accommodate such situations.
3. Verify if there are other (non-GPDB processes) running on the master/segment server. It can happen that there are many memory consuming processes, ex, java processes, etc. which might have taken up the memory causing GPDB queries to fail. However, one will not be able to comment if there was anything else running around the time when OOM failures were observed since it's not recorded in logs, but the users who administer the system might have information on it, else monitoring needs to be scheduled.
You can use "ps/top" commands to verify the process occupying high memory. Example commands
ps -eo pid,pmem,rss,vsz,cmd --sort=vsz
top (then) F (then) > q (then) -> Enter (It will sort the output based on RSS memory utilization in descending order)
4. Verify using "sar -r" output the amount of memory free around the time OOM is reported. However, since the samples for sar memory are taken every 10 min, it may not have been captured if the memory consumption went high momentarily in between the sample.
sar -r -f /var/log/sa/saXX (where XX is the date, by default sar creates a file under /var/log/sa using the date ex. sar05, sar06, etc. You may be able to see files for the last 1 month).
The output will be similar to
12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 12:10:01 AM 33650484 32235996 48.93 207844 25584116 11850416 11.92
In the above output, kbmemfree is the amount of memory free in kb at 12.10.01 AM. This output does not consider SWAP space.
Based on the analysis, there could be the below possible paths
- There were some badly written queries which required tuning or updated statistics were not there. In this case, after applying Greenplum best practices the query should run successfully.
- There are other non-GPDB processes running on the system. In this case, the only option is to move those processes to some other servers and not have them co-located with the segments, or if there is a possibility to restrict the memory usage for those processes a limit must be imposed. Also, calculation of gp_vmem_protect_limit needs to be revisited.
- For some queries which are identified as having high memory usage based on the results of mem_watcher, gp_vmem_protect_limit is not respected as there are some unaccounted processes during query processing which are causing OS memory due to consumed and resulting in OOM. In this case, the issue should be reported to support.