Post

3 followers Follow
0
Avatar

CQs stop working after a while

hi there.

I'm running gemfire 6.6.4, I have 2 locators running as --peer=true --server=true, and they startup one by one.
and Ihave 2 replicate server connected to these locators.

but recently we met a strange issue, the CQs stop sending message to subscriber after while, then we rebouce the locators, and reexecute the CQ, and it will works fine for hours then stop working. somehow I have only locactor then, it's working. I didn't find any specific errors. just this warning:
[warning 2014/09/03 02:43:29.539 EDT <locator request thread[2]> tid=0xf] Expected one of these: [class com.gemstone.org.jgroups.stack.GossipData] but received LocatorListRequest{group=null}

anyone know this issue?

LiangleiPan Answered

Please sign in to leave a comment.

6 comments

0
Avatar

I'm not sure that the warning you got has anything to do with the CQs stopping. I have seen several networks configured to automatically drop connections after a while that are only reading and never writing. Maybe this is your issue. Depending on how your network is configured you may need to enable TCP keepalives.
The gemfire.enableTcpKeepAlive system property prevents connections that appear idle from being
timed out (for example, by a firewall.) When configured to true, GemFire enables the SO_KEEPALIVE
option for individual sockets. This operating system-level setting allows the socket to send verification
checks (ACK requests) to remote systems in order to determine whether or not to keep the socket
connection alive.
The default for this is true, so if this is the issue it is being changed to false in your gemfire.properties file.
We have seen some networks that are so aggressive about shutting down "idle" sockets that even with TCP keepalives enabled they still shut down, and the endpoint doesn't get any notification. You might ask your network administrator about this.

Michael Stolz 0 votes
0
Avatar

I'm not sure why this issue would have anything to do with locators. Like Mike said, it sounds like maybe a firewall or TTL issue with the server-to-client connection. What minor version of 6.6.4 are you using? Both SO keep alive and server-to-client pinger were introduced in the 6.6.4 minor releases. The pinger is automatic every three minutes by default, but SO keep alive is not enabled by default in 6.6.4. Getting a thread dump of the server might tell us something (like whether the MessageDispatcher thread that sends messages to clients is stuck or has died unexpectedly). Server stats would also tell us if the queue contains any entries.

Barry Oglesby 0 votes
0
Avatar

Hi Michael, Barry.

thanks a lot for you quick response.

we don't disable TCP keepalives in gemfire.properties.
and we have the 2 replicate server in 2 machines, and have same configurations.
We are using gemfire 6.6.3.4 for java client/server, and 3.5 for .net lib.

So we have a lot of CQs registered at java clients, and they always working fine to get message update. but the .net client has this issue.

and we have another cluster in WAN through multi-site, have similar nodes and locators, and but has no java clients, only .net clients, they always working fine.

so I don't understand.
1, 2 sites has similar configuraion, one site which has issue has java client cqs and .net client cqs, the site which only has .net cqs has no problem .
2, why it working fine if I have only locator enabled in the cluster.
3, why it's working again only if I restart the locators.

I have a config change on client-subscribtion for all server on both sites, not sure relevant or not, I make the memory consumption very small to save memory.
<client-subscription eviction-policy="mem"
capacity="5" overflow-directory="./cacheStore/" />

Another thing related to CQs, I notice that the cache can keep smaller when we don't have any CQs created, but once we have cqs it almost tripple the memory usage. I don't understand why creating CQs can cause so big memory consumption

I'll examine the jvm stack trace once it happens again to see whether message distributer stop working.
and will also examine the statistics by VSD, and the might contact SA to find out whether there's different network setup between 2 sites.

LiangleiPan 0 votes
0
Avatar

I think your best bet in this case is to file a support ticket. If you can provide artifacts (stats, logs, thread dumps, heap dumps/histograms), that would great.

Barry Oglesby 0 votes
0
Avatar

Hi Barry.

as you mentioned minor version of 664 introduce the pinger. which minor version you refer to?
and I only have 6.6.4 in my vmware downloads list. has no minor version listed, can you pls get the URL of that version?

thanks a lot

LiangleiPan 0 votes
0
Avatar

The server-to-client ping task is actually in 6.6.4. It pings the client every 3 minutes by default if there is no activity between the server and client (no other messages sent to the client). This task is controlled by the following java system properties:

gemfire.serverToClientPingPeriod (default=60000ms) - how often the task runs
gemfire.serverToClientPingCounter (default=3) - how many task runs before a ping is sent

Unless the timeout for idle sockets is < 3 minutes, you shouldn't have to change these properties.

The feature that is not in 6.6.4 is Client/Server TCP keep-alive property. It was introduced in 6.6.4.3. The java system property called gemfire.enableCsTcpKeepAlive in 6.6.4 enables the socket's SO_KEEPALIVE. It is false (disabled) by default in 6.6.4. To enable it, set it like this on the clients and servers:

-Dgemfire.enableCsTcpKeepAlive=true

Note: The property name and default value are different in 8.0.

You should upgrade to the latest 6.6.4 release. You'll have to go through support to get that.

Barry Oglesby 0 votes