- Pivotal Cloud Foundry® (PCF) 1.9, 1.10
- Iaas: vSphere
- CFOps less than v 3.16
A user tried to run a full backup of ERT and after the backup completed none of the "monit" monitored services in the cloud controller where started, therefore, leaving PCF in an unusable state.
The following error was seen in the backup_log:
2017/08/21 05:45:50 E0821 05:45:50.121512 19139 createCliCommand.go:52] there was an error: failed calling ChangeJobState: failed calling http client: Put https://10.xx.xx.x:25555/deployments/cf-3dxxxxxxxxxxxxxx/jobs/cloud_controller/2?state=started: read tcp 10.xx.xxx.xxx:56560->10.xx.xx.xx:25555: read: connection timed out running backup on elastic-runtime tile:tile
The logs don't display the request that is timing out:
- New connections to the BOSH Director always work.
- Small backups work (<15mins test run was fine)
- TCP connections are in idle state between the CFOps machine and the BOSH Director (nginx)
We see two possibilities causes for this:
- CFOps reuses the same TCP connection which is timing out.
- It's using a new connection but the token is old and no logs about this are published.
The fix is to create a new connection each time CFOps machine needs to communicate with the BOSH Director.
The fix is included in CFOps v3.1.7: