GemFire 7 and later
The purpose of this article is to provide a quick overview of best practices for networking with GemFire in the three important categories:
- Low latency
- Fault tolerance
Latency is the most common performance bottleneck for network dependent systems like GemFire. This can be achieved following the best practices:
- Keep all members of a GemFire distributed system and their clients on the same LAN and preferably on the same LAN segment. The goal is to place all GemFire cluster members and clients in close proximity to each other on the network. This not only minimizes propagation delays, it also serves to minimize other delays resulting from routing and traffic management. GemFire members are in constant communication and so even relatively small changes in network delays can multiply, impacting overall performance.
- Distributed systems like GemFire generate high volumes of network traffic, including a fair amount of system management traffic. Encrypting network traffic between the members of a GemFire cluster will add processing delays even when the traffic contains no sensitive data. As an alternative, consider encrypting only the sensitive data itself. Or, if it is necessary to restrict access to data on the wire between GemFire members, consider placing the GemFire members in a separate network security zone that cordons off the GemFire cluster from other systems.
- Although bandwidth alone does not determine throughput - all things being equal, a higher speed link will transmit more data in the same amount of time than a slower one. Distributed systems like GemFire move high volumes of traffic through the network and can benefit from having the highest speed link available. While some GemFire customers with exacting performance requirements make use of InfiniBand network technology that is capable of link speeds up to 40Gbps, 10GbE is sufficient for most applications and is generally recommended for production and performance/system testing environments. For development environments and less critical applications, 1GbE is often sufficient.
- Use IPv4. By default, GemFire uses Internet Protocol version 4 (IPv4). Testing with GemFire has shown that IPv4 provides better performance than IPv6.
GemFire systems are often called upon to handle extremely high transaction volumes and as a consequence move large amounts of traffic through the network. As a result, one of the primary design goals in architecting a GemFire solution is to maximize network throughput which can be achieved in the following ways that assumes that TCP and IPv4 is being used:
Increasing TCP’s Initial Congestion Window allows TCP transfers more data in the first round trip and significantly accelerates the window growth which is an especially critical optimization for bursty and short-lived connections. The parameter to control this is
net.ipv4.tcp_congestion_window which Defaults to 1. This is recommended to be increased to 10. This is done on the network interface by adding a couple of lines like the following to
defrt=`ip route | grep "^default" | head -1`ip route change $defrt initcwnd 10
Disabling TCP Slow-Start After Idle Disabling will improve performance of long-lived TCP connections, which transfer data in bursts. Set the parameter
net.ipv4.tcp_slow_start_after_idle to 0 to disable. By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request.
Enabling Window Scaling (RFC 1323) increases the maximum receive window size and allows high-latency connections to achieve better throughput. Set
to 1 to enable
Enabling TCP Low Latency effectively tells the operating system to sacrifice throughput for lower latency. For latency sensitive workloads like GemFire, this is an acceptable tradeoff than can improve performance. Set
net.ipv4.tcp_low_latency to 1 to configure TCP for low latency, favoring low latency over throughput
Enabling TCP Fast Open allows application data to be sent in the initial SYN packet in certain situations. TFO is a new optimization, which requires support on both clients and servers and may not be available on all operating systems. By default the
TCP_FASTOPEN feature is not enabled at runtime (unless you instructed that in the sysctl.conf file). Set
net.ipv4.tcp_fastopen to 1 to enable
In addition, increasing the size of the transmit queue can also help TCP throughput. Add the following command to
/etc/rc.local to accomplish this.
/sbin/ifconfig eth0 txqueuelen 10000
GemFire systems depend on network services and network failures can have a significant impact on GemFire operations and performance. As a result, network fault tolerance is an important design goal for GemFire solutions. Use Mode 6 Network Interface Card (NIC) Bonding – NIC bonding involves combining multiple network connections in parallel in order to increase throughput and provide redundancy should one of the links fail.