Pivotal GemFire versions 8.x and 9.x
The purpose of this article is to generate awareness about how critical it is to avoid at all cost the usage of manual-start=true, or manually stopping, a parallel gateway-sender. Long story short: both operations can cause catastrophic failures within a GemFire cluster, including data loss and thread exhaustion.
Whenever a parallel gateway-sender is attached to a region, an internal shadow region queue is created by GemFire on the member where the original region is hosted. This internal region is used as the backing data structure to queue and dequeue events that need to distributed by the gateway-sender. Its actual type depends on the distribution strategy configured for the gateway-sender: it will be created as REPLICATED when using serial distribution (parallel=false), and it will be created as PARTITIONED when using parallel distribution (parallel=true). Its attributes, on the other hand, will be configured based on the original region to which the gateway-sender is attached plus the configured options for the gateway-sender itself. Moreover, the shadow PARTITIONED region queue created when the gateway-sender is configured as parallel will always be co-located with the original PARTITIONED region to which the gateway-sender is attached.
On the other hand, and by default, when using PERSISTENT PARTITIONED regions, the full recovery does not occur until all co-located regions have been created and recovered from disk. This happens by design, primarily to avoid moving data in some of the co-located regions before all of the co-located regions have actually been recovered. Otherwise, the co-location might be broken and erratic behavior would happen.
That said, the gateway-sender is internally implemented in such a way that the recovery of the internal shadow region queue only happens during the start of the sender itself. If the parallel gateway-sender is configured as persistent with manual-start=true, the recovery of the internal shadow region queue won't happen until the user manually starts the sender, which also means that the original region data to which the gateway-sender is attached will be totally empty as well, at least until the gateway-sender starts and recovers it data.
To make things worse, every operation that relates to the original region in any way (QOL, get, put, etc.) will be blocked because the internal thread loading the region in memory hasn't finished its work yet. It won't finish until the gateway-sender actually starts and the co-located shadow region queue is recovered. Thus, with every new operation, a new thread will pile up behind the original thread trying to recover the region. This, ultimately, will cause the server to start rejecting connections due to thread exhaustion or max connections hit, "masking" the root cause of the problem even more.
In addition, a parallel gateway-sender must never be stopped if data-loss wants to be avoided. The API and gfsh command used to stop a gateway-sender stop it in just one member, which causes data loss because events to buckets in that member will be dropped by the stopped gateway-sender. The PARTITIONED region does not failover in this scenario since the member is still running. Instead, and to ensure that the remaining events are sent, the user must shut down the entire membership to ensure proper failover of the PARTITIONED region events. When a member of the stopped parallel gateway-sender is shut down, the other parallel gateway-sender members hosting the partition region become primary and deliver the remaining events. In addition, if the whole cluster is brought down after stopping an individual parallel gateway sender, then events queued on that gateway sender will be lost.
In summary, never stop a parallel gateway-sender and never use manual-start=true, this last property is already deprecated and will be removed in future releases.