GemFire 7 and later
Visual Statistics Display (VSD) is a visual tool for analyzing GemFire statistics and is probably the most important GemFire tool to understand as it is used both when tuning GemFire and for troubleshooting most GemFire issues. The purpose of this article is to help users in getting started with the Visual Statistics Display (VSD) tool and understand how to use it to troubleshoot GemFire issues.
VSD works by reading GemFire statistics from *.gfs archive files created by GemFire, and renders their graphs for analysis. It is not a real-time online monitoring tool, such as Pulse, so it does not have the real-time monitoring and alerting capabilities that online tools have. On the other hand, it is the most powerful tool for examining the state of a GemFire system, as it provides access to a very high number of statistics collected by GemFire which includes GemFire, Java, and OS parameters. No real-time monitoring tool can do that as the amount of statistics that GemFire collects is prohibitive for real-time collection in a distributed system.
Having a complete view into the state of a GemFire process is what makes VSD an indispensable forensic tool for performance analysis and tracking down problems by performing offline analysis of distributed systems using statistics gathered by the cluster. It is also helpful any time it is needed to verify the runtime state of a distributed system. For example, upon startup or data loading, to make sure that all the nodes are present and see one another that all the entries are loaded and well balanced across all the nodes, or that JVM heaps have enough headroom.
The amount of statistics available for viewing in VSD can be overwhelming. This article will point out some of the most important statistics that are useful in verifying the state of a distributed system, including its configuration, resource usage, and throughput for different operations.
Getting Started with VSD
Since GemFire version 7.0, VSD is included with GemFire and is located in the tools subdirectory of the product directory tree. A brief user guide is included in the GemFire User's Guide.
An important prerequisite for VSD is that the collection of GemFire statistics be enabled at runtime. That is accomplished by setting the configuration properties:
As the collection of statistics at the default sampling rate of one second does not affect performance, it should always be enables during development, testing, and in production. Note that it is also possible to enable statistics without the need to bring down the GemFire cluster. This can be done with the gfsh "alter runtime" command.
There is a special category of statistics called time-based statistics that can be very useful in troubleshooting and assessing performance of some GemFire operations, but they should be used with caution because their collection can affect performance. They can be enabled using the property:
Limit disk space usage
As with log files it is important to configure statistics rolling to manage disk space usage. To setup rolling of statistics files use the following parameters:
This will cause gfs files roll when they reach 100MB and keep the last 10 files, reaching a maximum of 1GB of used disk space. File sizes may differ from environment to environment in order to strike the right balance between disk space usage, archiving, and easy handling of files.
Analyzing the Data
Once a distributed system is up and running, every GemFire instance will create its own statistics files. The best way of loading these files into VSD is to copy all the stat files into one directory and then add them as parameters when launching VSD. To do this it is important of course to name each servers statistics files differently. Using the host plus member name is a good practice. An important note when looking at statistics and comparing to events from the GemFire logs is that VSD shows the time in the time zone on the machine running VSD and not the time zone in which the statistics and logs were created. Setting the time zone before launching VSD will help in interpreting data and correlating events with log entries. See this article for details on getting the time zone from the gfs files on how to create a script for launching VSD with the correct time zone.
Once you have VSD running and statistics archives loaded, it will be populated with an overwhelming amount of metrics.
Make sure that the statistics from all members covers the timeline of when the incident happened. You can achieve this by opening a graph for any of the metrics and then select:
- Chart - Time Format - Month/day
The number of types and parameters in each section is quite overwhelming. Setting "Main - No Flatlines" helps by showing only those parameters that changed value during the time span of the statistics file.
Overview of principal statistics
Begin by taking a look at the Quick Guide to Useful Statistics in the GemFire User's Guide. The following are additional checks to make:
Basic health check
Open the type statSampler and the parameter delayDuration. This should be roughly a straight line showing the sampling rate configured. If there are many deviations from the flat line and these are over 100% the system is having trouble.
Another important thing shown in statsSampler is jvmPauses. These are not necessarily full Garbage Collection stop-the-world pauses, but a lack of resources that impacts the statsSampler so it cannot collect data. These events will also get logged with the following message in the member logs:
[warning 2015/01/21 13:39:17.935 CET <Thread-6 StatSampler> tid=0x2e] Statistics sampling thread detected a wakeup delay of 3,173 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
LinuxSystemStats - ioWait
ioWait is another useful health indicator if using persistence. It is a percentage of waiting operations. It should be below 10% if the system is healthy.
Recommendation: Use local disks for persistence instead of network storage. If using network storage SAN is recommend over NFS.
distributionStats - nodes
Check if there are any nodes going down or up after system startup. It shows the number of known nodes in the distributed system. If it is a flat line it means the node is the last to come up. If not you will see a staircase formed graph.
distributionStats - replyWaitsInProgress
This can go up and down. It is a problem if it doesn't come down to zero. In this case, it is waiting for acknowledgment from another member, so you should look for the member that it is waiting for. If there are nodes stuck at a non-zero value, you will need thread dumps from these members to figure out is deadlocked.
ParNew collections should occur roughly one per second to every 15 seconds as a guidelines. More than one parNew collection per second is bad. Collection time should be low a percentage of total time.
CMSOldGen - Heapmemory
Check the metrics, currentMaxMemory and currentUsedMemory, under CMSOldGen-Heapmemory. If these just climb continually, there is no garbage created or something is broken with GC.
For LinuxSystemStats - cpuActive, check against cpuUser and cpuSystem to determine if the CPU is used by GemFire or a third party process.
LinuxSystemStats - contextSwitches
It is a bad sign if CPU usage high when contextSwitches is high.
If you are at 750 milliseconds with 4 CPUS, you are using 20% of CPU on Disk I/O. How many CPUs does the host have? Check vmStats - cpus to check how many cpus are on system
LinuxSystemStats - loadAverage
Showing how many threads are running concurrently
LinuxSystemStats - freeMemory-linuxStats
Get an idea of how much you can increase your heap for GemFire. Should start at physicalMemory and go down. If it starts below physical then you have other processes using memory besides OS and GemFire. How many members are running on the same host or VM? Maybe there are several members fighting for memory.
DiskRegionStatistics - xxxxxCache
Show all (Chart - Show Legend (turn off))
Note that when an entry is evicted to disk there is still the key and a map to the value stored in memory so GemFire can retrieve the value from disk. If the keys are relatively big in comparison to the values eviction will not free up much space.
CacheServerStats - currentClientConnections
Check these over all members to see if load is evenly spread across the cluster.