Pivotal Knowledge Base

Follow

GemFire: Rebalance - Achieving and maintaining balanced PR data

Applies to

GemFire versions 7+

Purpose

The purpose of this article is to help customers be more proactive in making sure that their distributed systems are healthy, from the perspective of having and maintaining balanced data across all partitioned regions over time. This will allow for optimal performance spreading out the load of requests coming into the GemFire cluster.

Symptoms

When your partitioned region data is out of balance, you may encounter a variety of issues. You may run out of memory or hit configured eviction/critical thresholds. You may experience bad put performance in your environment. You may see cpu related issues due to the higher than expected load of activity coming into a given node because so much of the data exists on this node, compared to others. Whatever the case may be, it may be worth analyzing the environment to determine whether your data is balanced as it should be. If not, learn how to take corrective action, and also how to be more proactive in preventing the situation from developing again in the cluster.

Description

The Rebalance command in GemFire is used to move your partioned region data around the cluster to achieve balance for both primary and secondary buckets. Without such balance, performance of the distributed system can be negatively impacted, sometimes to a great degree, due to the overload of requests directed towards the heavily loaded members, while members with little data become underutilized in comparison.

Rebalance Command

One way to be more proactive in maintaining this data balance across the cluster is to perform a "rebalance --simulate" command. The output from this command will indicate how much data would be moved if a true rebalance were to be executed at that time. This simulation is non-invasive to your ongoing operations as no data movement occurs when using the simulate option for the rebalance command.

If the output indicates that little to no data would be moved, then you are likely running optimally from a PR balanced data perspective. However, if the algorithm suggests that a high number of bytes would be moved (>10% of your data), then you may want to consider a rebalance during any available servicing window, or at least some off peak time.

It is important to state that a real (non-simulation) rebalance can take some time and impact the onoging operations/transactions in your cluster as the buckets move around. You must determine whether your cluster can handle the impact of the rebalance on such activity. It is possible to rebalance a region at a time as well if you deem it necessary or more prudent to limit the scope of the rebalance to one region at a time.

Monitoring

It is possible to monitor each partitioned region for balance across the distributed system by using the RegionMXBean. We have 2 methods that can be used to determine whether the region is balanced across the cluster.

  • getPrimaryBucketCount
  • getBucketCount

The getPrimaryBucketCount() method returns the number of primary buckets on the given member for the given region. Executing this on all members should provide an indication of whether the primarys are balanced. If you accumulate all of the members' primary bucket counts for a given region, it should equal total-num-buckets, which, by default, is 113.

The getBucketCount() method returns the toal number of buckets that exist on the given member for the given region, whether primary or secondary. If you accumulate all of the members' bucket counts for a given region, it should equal (total-num-buckets * (redundant-copies)). If you maintain only 1 redundant copy, then the total bucket count across all members should be twice the primary bucket count.

Show Metrics

It is possible to examine these values through gfsh as well, using the show metrics command:

gfsh>show metrics --categories=partition --region=region_name

The output of this command will show you the bucketCount and the primaryBucketCount, among other values such as the number of buckets without full redundancy, etc. Please refer to the section "Checking Redundancy in Partitioned Regions" section of our Gemfire documentation using this link

VSD

Finally, our VSD tool can be used to examine the PartitionedRegionStats to show the bucketCount and primaryBucketCount for any partitioned region configured. How to use VSD can also be found using the documentation at the above link.

Comments

Powered by Zendesk