Pivotal Knowledge Base

Follow

Virtual machines might hang after reboot or Bosh update

Environment

  • PCF 1.9, 1.10, 1.11

Symptom

After executing a bosh deployment, some virtual machines may appear hung. Users will not be able to login via ssh or direct console. This issue is commonly observed during a new deployment of the ha_proxy job from Pivotal RabbitMQ tile. Users may also observe this hung state after a service restart or reboot of a virtual machine which uses the Metron agent.

Common observable symptoms include:

  • Deployment hangs for several hours but never succeeds. Bosh agent will be responsive on impacted job, and Bosh Director will report the job in a running state
  • Deployment fails with the following errors
    Error 450002: Timed out sending `get_task'
  • Bosh SSH will fail with a user add error
    ubuntu@pivotal-ops-manager:~$ bosh ssh rabbitmq-haproxy/0
    
    Setting up ssh artifacts
    
    Director task 1193
    Error 450001: Action Failed ssh: Creating user: Shelling out to useradd: Running command: 'useradd -m -b /var/vcap/bosh_ssh -s /bin/bash bosh_60tkmg3s4', stdout: '', stderr: 'useradd: cannot lock /etc/passwd; try again later.
    ': exit status 1 

Cause 

There is a known race condition that will cause rsyslog daemon to hang on start which blocks any application that attempts to write to /dev/log. For more detailed information see the GitHub issue #1188. Versions 8.21-8.22 are affected by this bug. Users can run the following command to see what version of rsyslog is installed

~$ rsyslogd -v
rsyslogd 8.22.0, compiled with:
	PLATFORM:				x86_64-pc-linux-gnu
	PLATFORM (lsb_release -d):
	FEATURE_REGEXP:				Yes
	GSSAPI Kerberos 5 support:		No
	FEATURE_DEBUG (debug build, slow code):	No
	32bit Atomic operations supported:	Yes
	64bit Atomic operations supported:	Yes
	memory allocator:			system default
	Runtime Instrumentation (slow code):	No
	uuid support:				Yes
	Number of Bits in RainerScript integers: 64

See http://www.rsyslog.com for more information.

Resolution

Workaround

When stuck in the hung state a hard reboot of the virtual machine should restore access

  1. Locate the VM ID by running this command
    ~$bosh vms --details
    
    Deployment 'p-rabbitmq-5f35e0b36d7b817ae30d'
    
    Director task 1232
    
    Task 1232 done
    
    +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+
    | VM                                                        | State   | AZ  | VM Type | IPs           | CID                                     | Agent ID                             | Resurrection | Ignore |
    +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+
    | rabbitmq-haproxy/0 (47828356-55fc-43fd-a98c-645b9e33c66b) | failing | az1 | small   | 10.193.90.104 | vm-6ab1c8c9-8f93-4aeb-8837-cfde43a4d63a | 0016905a-04f9-41bc-afc9-9945f595e755 | paused       | false  |
    +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+
    
  2. Power Cycle the VM from the IaaS console

Attempting the deployment again may result in a successful deployment. If that happens, users can apply this directive to the /etc.”./rsyslog.conf and restart rsyslog with command “service rsyslog resta.” Please note any redeployment of the modified VM will result in omission of the param change and could return the hang condition.

global(processInternalMessages="on")

Fix

  • This is fixed in Stemcell version 3421.18.  The relevant commit can be found here
  • RabbitMQ Tile version 1.8.14 includes a workaround that prevents this issue from surfacing during deployment of the ha proxy

 

 

Comments

Powered by Zendesk