Pivotal Cloud Foundry® 1.10 and above
Pivotal Cloud Foundry Runtime for Windows 1.12
Windows Diego cells can become unresponsive or enter a failing state intermittently. The state of Windows Diego cells is very inconsistent.
This issue can manifest intermittently and on subsets of Windows cells in Runtime for Windows tile deployments, but has the potential to disrupt all the Windows cells in a deployment, meaning applications can no longer route traffic during and after a PCF upgrade (or other causal events).
Applications hosted on Windows cells become unresponsive and do not recover during PCF upgrades or other loss of network connectivity events because
127.0.0.1 (Consul) gets dropped from the DNS resolvers list on the Windows hosts. When Cloud Foundry jobs cannot contact the Consul DNS, they cannot resolve
This happens when the bosh director becomes unavailable or loses connection with Windows VMs (stemcell
1200.6 or earlier), such as during a PCF upgrade, the BOSH Agent on BOSH-deployed Windows VMs (including Windows Diego cells) restarts. This is expected behavior that continues during the time it cannot contact a director; the BOSH Agent also exponentially backs off its restart timing, to a maximum interval of 5 minutes between restarts, to minimize CPU load on the cell.
During this multiple restart scenario, the BOSH agent was erroneously overwriting all DNS resolver entries in the OS with the list of cloud config resolvers, thus removing the necessary
127.0.0.1 value, inserted by Consul during the Consul job’s pre-start process. This pre-start process is not executed again by the BOSH agent upon its restart, but the core issue lies in how the BOSH agent overwrites the DNS resolver entries.
Since any loss of connectivity can cause this issue, it means that in addition to PCF upgrades, network events (like router replacements), director failures, increased ESX load, and possibly others, could cause this issue.
If you're unable to do that at this time, you can perform the following steps as a temporary workaround.
For the stemcell versions
1200.6 and below, the IP Address
127.0.0.1 can be added manually to DNS Resolvers to fix the issue. However, the bosh agent restart will remove this IP again, causing the same issue described above.
- Connect to each Windows cell either via your IaaS virtual console or via RDP.
- Edit the DNS configuration by navigating to the Control Panel > Network and Internet > Network and Sharing Center
- Under network connections, choose Ethernet.
- Chose Properties > TCP/IPv4 > Properties
Note above, that 127.0.0.1 doesn’t appear in the DNS list under the section “Use the following DNS server addresses. This is the cause of the issue.
- Click on Advanced.
- In Advanced TCP/IP Settings pane, click on DNS tab, then click Add.
127.0.0.1in the dialog box and click Add.
- Use the up/down arrows to move
127.0.0.1into the first position.
- Click OK to close Advanced TCP/IP Settings.
- Click OK to close TCP/IPv4 Properties.
- Close Ethernet Properties to persist changes.
The fix has been applied. The BOSH jobs should eventually become healthy, and then apps can serve traffic and have new apps
cf pushed to the cells.
For more information on how to connect to a Windows Cell via RDP, please see this article How to generate a randomized Administrator Password