The queries fail due to the error message: "VM Protect failed to allocate %d bytes, %d MB available", What does the error message mean and where can I check and understand the reason for the error?
In this document, we will take a look explaining the error message and some basic troubleshooting steps that can be taken.
What does the error "VM Protect failed to allocate %d bytes, %d MB available" mean?
gp_vmem_protect_limit parameter limits the amount of memory that can be allocated on the segments to queries, if the attempt to allocate memory to a query causes gp_vmem_protect_limit to exceed, the query will be canceled and the error message as noted above will be recorded in the database logs.
This is a self-defence mechanism introduced to avoid over-consumption of the memory by Pivotal Greenplum(GPDB) database queries and is not a bug.
What are some action that can be taken to troubleshoot the issue?
There can be multiple reasons because of which queries running on a system can hit such an error and we should first try to isolate if the issue seen is due to concurrency or not.
Steps which should be performed
-- 1. Analyze the master and segments logs for the required time window to isolate if the problem is query specific or most of the queries failed with the same error message.
Ex: egrep "Out of memory|VM Protect" gpdb-YYYY-MM-DD_nnnnnn.log
-- 2. If there is only a limited number (Ex. 1 or 2) of queries which failed due to the error message as per the logs.
Identify the query, and try to execute the queries on the database to identify if it's reproducible.
If the issue is reproducible, you can proceed further to extract answers to the below questions.
- Is it a new query or It has been running since long without issues?
- Has the data changed, ex: number of records in the source tables?
- Was there any database parameter changed recently or any database upgrade?
Below is the list of checks one should perform to identify why the query is consuming such high amount of memory and if an immediate resolution can be found.
2.a Verify if all the tables used in the query are analyzed.
2.b. Review the explain plan of the query to verify any abnormalities, specifically
- Nested Loops: If the explain plan suggest that the query is using Nested Loops, you can try disabling Nested Loops to force Hash join.
set enable_nestloop=off ; explain analyze <query>;
- Issues due to join order, try setting the below two parameter to lower value and retry the query
- Verify if there is any processing skew due to which only a few servers are loaded.
If using the above steps a quick resolution cannot be found, one has to look further to tune the query.
+ Isolate parts which are causing higher memory consumption and see if the query can be changed. Ex. create temporary tables to stage the data and then use simple queries. This will reduce the amount of memory required for query processing.
+ Verify if the join condition in the query uses same keys as that of distribution policy of the table.
+ Refer to the link "Guidelines to tune queries on GPDB" for examples on query tuning on GPDB.
-- 3. If there are several queries which failed or the issue could not be reproduced after running the queries then it requires more extensive exercise to identify the problem.
In most cases, if the tables are analyzed and the same queries are running successfully when run individually, such issues can occur due to higher concurrency of queries, i.e Number of queries running simultaneously on the system is high which is causing the memory requirements to hit gp_vmem_protect_limit. In such cases, one should follow the below path:
3.a. Consult with end users and identify if "ad-hoc" queries are allowed on the system or only "scheduled" loads are run on the database system. If "ad-hoc" queries are run there are high chances that some of the users might have triggered a bad query which caused the issue and its hard to predict a timeline for the same for reoccurrence, if "scheduled" loads are run usually the issue should occur around the fixed timeline.
3.b. Execute mem_watcher script to track the memory consumption of the queries on the segments.
- If "ad-hoc" queries are run on the system, you may keep the mem_watcher script running in the background
- If only "scheduled" queries/loads are run on the system, you may run the mem_watcher script for the duration of the batch
3.c. You can also set the value of gp_max_plan_size to avoid queries running out of memory due to higher memory requirement for handling the plan. To start with, you can set it to 200MB and later review if it needs to be increased or further tighten up.
How to set
gpconfig -c gp_max_plan_size -v "200MB" gpstop -u gpconfig -s gp_max_plan_size
Note: Amount of memory required to handle plans is not accounted under gp_vmem_protect_limit.
3.d. Once the issue has reoccurred while mem_watcher was running in the background, you need to stop mem_watcher and review the data collected after generating a report using memreport utility.
- Review the steps available at link Utility: mem_watcher & memreport
- Based on the output of memreport you need to decide if the memory consumption by the queries is acceptable and the issue is due to concurrency, or there are one or few queries which are occupying higher memory and need to be evaluated for further tuning.
3.e. If you are on GPDB 184.108.40.206 or above, you can discuss with the end users and work to the set the parameter "runaway_detector_activation_percent" to limit the excessive usage of gp_vmem_protect_limit and cancel queries which are consuming higher memory at that point of time. Those queries can be further reviewed to identify the cause of high memory usage. If the query do reach the limit you will encounter the error similar to the one mentioned here.
Please refer to the documentation for more details.
Note- Always validate if the tables are analyzed.
Based on the analysis, there could be the below possible paths:
1. There were some badly written queries which required tuning or updated statistics were not there. In this case, after applying GPDB best practices the query should run successfully.
2. All queries were tuned, but concurrency is the issue.
In this cases, there are a couple of options which are available and can be recommended to the end users.
- Reduce query concurrency using a resource queue
- Validate gp_vmem_protect_limit and increase it if the servers has capacity
- Use a session setting to reduce the statement_mem used by specific queries
- Decrease statement_mem at the database level
- Set the memory quota on a resource queue to limit the memory used by queries executed within the resource queue
- Increase memory on the host
- Decrease the number of segments per host in the cluster [Needs planning]
3. Might be an unexpected behavior In this case, please open ticket with support.