Pivotal Cloud Foundry (PCF) Versions 1.7.x upgrade to 1.8.x
When upgrading from PCF 1.7 to 1.8, it is possible to see downtime because of a known issue with the Diego Brain VMs. This downtime can affect the platform in a few different ways. You may see one or more of these symptoms.
- GoRouters may lose their routes and applications may become inaccessible. This would result as 404's when attempting to access applications on the platform.
- Diego Brain VMs may lock up and become unresponsive. Attempts to SSH into the VMs will fail.
- You may see errors like connect: no buffer space available or dial tcp: i/o timeout in the Diego Brain's route_emitter logs.
- You might see errors like TCP: out of memory -- consider tuning tcp_mem on the Diego Brain VMs.
- Bosh may report being unable to communicate with the Diego Brain VMs. This could result in a failure during the upgrade, as the resurrector running and recreating a Diego Brain VM or as the job reporting as "unknown" when you are running`bosh instances`.
Part of the upgrade process is to update Diego's Bulletin Board System (BBS) and this happens before the Diego Brain VMs are upgraded. The start of this issue is that the Diego BBS in 1.8 has a new type of event in it's event stream and the clients on older Diego Brain VMs do not recognize this event type. This makes the client on the Brain VM think that there is an error with the stream and causes the client to reconnect. Unfortunately, when this happens the older version of the client does not clean up the previous connection. Over time the number of these connections goes up and eventually the VM's network stack will slow down and stop working (this is the cause of most of the symptoms above). When this happens, it also causes the route-emitter process that runs on the Brain VM to be unable to update application routes. Because the application routes stop getting updated, they will eventually timeout and be pruned from the GoRouters. This is what causes app requests to 404.
For the most cases, this issue will correct itself. As the upgrade progresses, the Diego Brain VMs will be upgraded. This will clear out any stray connections and upgrade the client to understand the new event stream.
In some cases, it will take too long to upgrade the Diego Brain VMs and one or more of the VMs will get into a state where it's stuck and the upgrade fails. If this happens, you can unstick the VM after the installation has failed by running `bosh recreate` on it or by powering off the VM, running `bosh cck` and choosing the option to recreate the VM. Once the VM has been recreated, you can click Apply Changes in Ops Manager to restart the upgrade, which should now complete successfully.
The likelihood of this issue occurring is low but depends on a couple of conditions in the environment that's being upgraded. The first condition is the number of crashing applications on the platform. Applications in a crashed state will trigger the new event type, which in turn triggers the reconnect. Thus the larger the number of applications are crashing, the more likely this problem will occur.
The second condition is the amount of time between when the Diego BBS is upgraded and when the Diego Brains are upgraded. The more time between the two, the more likely you are to see this issue. This is because there is now more time for reconnects to occur and thus more stable connections to build up and cause problems.
Pivotal has taken steps to try and mitigate the cases where either of these symptoms would be triggered and thus reduce the likelihood of this occurring. However, there may be variations unaccounted for or perhaps other scenarios that trigger the same issue and in those cases you might see this issue occur.