Pivotal Knowledge Base

Follow

Spring Cloud Services Instance update fails due to Broker Worker /copy_bits failure

Environment 

Spring Cloud Services (SCS) 1.3, 1.4, 1.5

Symptom

When updating a Spring Cloud Services instance, for example, update a config server instance

cf update-service config-server -c '{"git":{"uri": "https://xxxxx"}}'

Update fails after a few minutes, Spring Cloud Services Broker Worker reports network I/O error with /copy_bits operation against Pivotal Cloud Foundry (PCF) API endpoint (https://api.SYSTEM_DOMAIN). 

2018-04-24T01:32:15.21+0200 [APP/PROC/WEB/0] OUT 2018-04-23 23:32:15.209 ERROR [spring-cloud-service-broker-worker,
dede2c16d0d80aec,dede2c16d0d80aec,false] 15 --- [cTaskExecutor-2] i.p.s.s.messaging.RequestHandler : Error updating
service instance: org.springframework.web.client.ResourceAccessException: I/O error on POST request for
"https://api.SYSTEM_DOMAIN/v2/apps/6fbb83d8-36ad-46cb-a8e9-47d291553c9b/copy_bits": api.SYSTEM_DOMAIN:443
failed to respond; nested exception is org.apache.http.NoHttpResponseException:
api.SYSTEM_DOMAIN:443 failed to respond
2018-04-24T01:32:15.21+0200 [APP/PROC/WEB/0] OUT org.springframework.web.client.ResourceAccessException:
I/O error on POST request for "https://api.SYSTEM_DOMAIN/v2/apps/6fbb83d8-36ad-46cb-a8e9-47d291553c9b/copy_bits":
api.SYSTEM_DOMAIN:443 failed to respond; nested exception is org.apache.http.NoHttpResponseException:
api.SYSTEM_DOMAIN:443 failed to respond

Cause

This issue happens in environments where the the customer's Load Balancer that sits in front of PCF is configured with an aggressive http-keep-alive such as few seconds. In such environments, when the SCS Broker and Worker starts a request to PCF API endpoint using a connection in the connection pool, the connection could by accident be reset by the Load Balancer at a very high rate due to the aggressive timeout.

Usually, the SCS Broker Worker retries failed requests, it starts new connection or uses another connection in the pool immediately and the retry succeeds. However, in environments with an aggressive http-keep-alive, the /copy_bits API request can fail and cause the exception because there is no retrial for this API endpoint. It does not retry the request because this particular API endpoint generates heavy load on the Cloud Controller. As a result, cf update-service fails due to /copy_bits failure.

Resolution

Increasing http-keep-alive on the Load Balancer from 1 or 2 seconds to 10 seconds can significantly reduce the connection pool I/O error. 

Additional Information

The problem happens with any client applications using the connection pool. Applications which do not use connection pool are not impacted. http-keep-alive means connection remains open but idle between response and new request.

Comments

Powered by Zendesk