Pivotal Knowledge Base

Follow

Bosh Returns Failed to Reserve IP when Ops Manager is Configured with 3 AZs

Environment

  • Operations Manager 1.x, 2.0
  • Environments with 3 Availability Zones

Symptom

Bosh::Director::NetworkReservationNotEnoughCapacity: Failed to reserve IP for 'diego_cell/98cf716a-8e47-4d5d-8c8e-005b8572629c (10)' for manual network 'deployment-network': no more available
ERROR -- DirectorJobRunner: Failed to reserve IP for 'diego_cell/98cf716a-8e47-4d5d-8c8e-005b8572629c (10)' for manual network 'deployment-network': no more available

Cause

In this case there are enough available IPs in the configured CIDR, however, Operations Manager and Bosh Director disagree on the distribution of instances across the defined availability zones.

RCA

NOTE: This description of the problem oversimplifies the issue to help establish an understanding of why Operations manager is updating the cloud config reserving most of the IP addresses. Also, it illustrates how the Bosh Director and Operations Manager can reach a disagreement on how many IP addresses are needed to satisfy the requested changes.

When Operations Manager prepares a deployment it will determine how many IP addresses from each AZ is required to complete the deployment. For example, if we have 3 AZ's and 10 instances to deploy. Operations Manager could place the instances as followed.

AZ1 AZ2 AZ3
3 3 4

In this example Operations Manager will decide AZ1 needs to have "192.168.1.4-192.168.1.255" reserved allowing bosh to use "192.168.1.1,192.168.1.2, and 192.168.1.3" for the 3 VM instance assignments. Same is true for AZ2 and AZ3. Operations Manager could add the following reservations.

AZ Range Reservation
 AZ1  192.168.1.0/24  192.168.1.4-192.168.1.255
 AZ2  172.168.1.0/24  172.168.1.4-172.168.1.255
 AZ3  110.168.1.0/24  110.168.1.5-110.168.1.255

Now let's assume the Operator decided to scale up the instance count from 10 to 11. Operations Manager determines the new Instance will be placed in AZ1 and creates the cloud config as followed. Notice in AZ1 will free up only one IP which is "192.168.1.4" in AZ1 as Operations Manager expects bosh to find this free IP in AZ1

AZ Range Reservation
 AZ1  192.168.1.0/24  192.168.1.5-192.168.1.255
 AZ2  172.168.1.0/24  172.168.1.4-172.168.1.255
 AZ3  110.168.1.0/24  110.168.1.5-110.168.1.255

During deployment, the Director decides to place the instance in AZ2 instead of AZ1 and determines there are no free IP's available resulting in the error condition observed.

This problem will most commonly surface during an upgrade, scaling up/down event, or new tile deployment that uses a network shared with another tile.  

In cases of upgrade where no scaling changes were made, there are hidden instance changes between tiles.  An example of this is the case of errand instances.  Errand can change between tile versions and Bosh views errands as instances which will also need an IP reservation.  These slight changes in the deployment can result in mentioned disagreement between Operations Manager and Bosh Director.

Resolution

Operations Manager 2.1 no longer performs IP address management. Instead, Operations Manager lets bosh decide how to assign IP addresses to VM instances.

Here are some ways to work around this until an upgrade is possible

Scaling Instance Counts

In small deployment scenarios, we can adjust the instance counts to be divisible by 3. This ensures that Operations Manager will reserve enough IP addresses in all three AZ's so bosh does not run out of IP addresses.  This may or may not work when you have two tiles sharing the same network.  We need to ensure the total number of instances across both tiles to be divisible by 3. 

Manually Free up IPs

In more complex cases like we described in the RCA example we can work around this problem by fetching the cloud-config from the director and freeing up IPs in the AZ's manually. In our simple example, we will be very conservative and only reserve one additional IP in AZ2. In Larger environments, it might make sense to free up 10 or more IPs in each AZ depending on how many changes are in flight

Note: The following procedure assumes Bosh CLI Version 2

  • Fetch the cloud config from the Bosh Director
    bosh -e env cloud-config > /tmp/cloud-config.yml
  • Locate AZ2 config
    networks: 
    - name: deployment-network 
    subnets: 
    - azs: 
    - AZ2 
    cloud_properties: 
    subnet: subnet-2228c111 
    dns: 
    - 10.10.10.10
    gateway: 10.34.192.1 
    range: 172.168.1.0/24 
    reserved: 
    - 172.168.1.4-172.168.1.255
    static:[]
  • Modify the Reserved Range to free up one ip. In this case change "172.168.1.4" to "172.168.1.5"
    networks: 
    - name: deployment-network 
    subnets: 
    - azs: 
    - AZ2 
    cloud_properties: 
    subnet: subnet-2228c111 
    dns: 
    - 10.10.10.10
    gateway: 10.34.192.1 
    range: 172.168.1.0/24 
    reserved: 
    - 172.168.1.5-172.168.1.255
    static:[]
  • Upload the changes to Bosh Director
    bosh -e env update-cloud-config /tmp/cloud-config.yml
  • Manually run the deployment.  Note the "cf-xxx" can be found by running "bosh -e env deployments"
    bosh -e env -d cf-xxx deploy /var/tempest/workspaces/default/deployments/cf-xxx.yml 
  • When deployment completes proceed to run "Apply Changes" from Operations Manager.   

Note: At the final step where you apply changes, it is important to understand that some network allocations might still fall out of sync with Operations Manager and Bosh.  If this happens again we may need to review the problem in more detail and re-run the workaround with additional cloud config changes. 

Comments

Powered by Zendesk