Pivotal Knowledge Base

Follow

How to collect kernel crash dump on Pivotal Cloud Foundry Ubuntu VM

Environment

Pivotal Cloud Foundry (PCF) 1.10 and later

Purpose

A user wants to collect kernel crash dumps from PCF Ubuntu instances. 

Cause

The linux-crashdump package must be installed and configured in order to enable kernel crash dumps to be collected in the event of a system crash. Swap size also has to be explicitly set in the Bosh deployment manifest; otherwise, the system becomes confused about how much swap should be configured, due to the memory used by the crashdump module. In this case, the swap partition may not mount, and PCF services may not start up.

Procedure

WARNING: This procedure loads a crash kernel dump module into the system, and it could cause VMs to crash or become unresponsive. This procedure is not recommended for production environments, and customers should assume any risk involved. Pivotal will not provide support for this procedure.

Follow these steps for each VM on which you want to collect crash dumps:

  1. Login as "ubuntu" to the Ops Manager VM.
  2. sudo -i
  3. cd /var/tempest/workspaces/default/deployments/
  4. make backup copy of manifest file - ex: 'cp cf-81912604505697bd91ff.yml cf-81912604505697bd91ff.yml.bak'
  5. vi <cf manifest file.yml> 
  6. find the section of the manifest for the VM, for example "diego_cell".
  7. add "swap_size" declaration under "env:", under "bosh:", as follows
     env:
       bosh:
          swap_size: <value>
    <value> should be equal to or less than the default swap size, in MB.

    Guidelines for setting swap size: https://discuss.pivotal.io/hc/en-us/articles/221625507

  8. exit (to leave root)
  9. bosh login - use email "director" and director credentials from Ops Manager.
  10. Disable resurrection - ex. 'bosh vm resurrection diego_cell/0 off' - this will prevent bosh from recreating the VM in the event that it becomes unresponsive long enough to trigger a resurrection event. If bosh recreates the VM, the crash dump files will be lost.
  11. bosh deploy
  12. bosh ssh; choose VM to login into.
  13. sudo -i
  14. apt-get update
  15. apt-get install linux-crashdump
  16. vi /etc/default/kdump-tools
  17. change "USE_KDUMP=0" to "USE_KDUMP=1"; save changes
  18. vi /boot/grub/grub.conf
  19. append to the line starting with "kernel" the following " crashkernel=384M-2G:64M,2G-:128M"
  20. reboot the VM
  21. check to see that the crash dump is enabled
    cat /proc/cmdline (should see the "crashkernel" value that you added to grub.conf)
    cat /proc/sys/kernel/sysrq (value should be > 0)
    ls /var/crash (directory which is configured in /etc/default/kdump-tools for core files should have been created)
  22. test to see if kernel dump occurs on panic:
    echo c > /proc/sysrq-trigger
    This should cause a kernel panic and reboot. 
  23. when system comes back up, login again using 'bosh ssh'
  24. sudo -i
  25. ls -l /var/crash - there should be a new directory named for timestamp value, with 2 core files inside.

To revert:

In Ops Manager, just click "Apply Changes". This will revert the PCF installation to its defaults.

Alternately, using bosh on the command line:

  1. Login as "ubuntu" to the Ops Manager VM.
  2. sudo -i
  3. cd /var/tempest/workspaces/default/deployments/
  4. Copy the backup CF manifest back to the .yml extension, i.e., 'cp cf-81912604505697bd91ff.yml.bak cf-81912604505697bd91ff.yml'
  5. exit (to leave root)
  6. bosh login - use email "director" and director credentials from Ops Manager.
  7. bosh deploy
  8. reactivate resurrection on each VM for which it was paused - ex. 'bosh vm resurrection diego_cell/0 on'

Caveats

If you make changes to your PCF setup using Ops Manager, it will overwrite the manifest file and recreate any VMs which were configured to create kernel core dumps. All customized changes will be lost. 

Additional Information

This is based on Ubuntu procedure: https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html

There is an extra step of updating grub.conf to include crashkernel parameter.

Additionally, it is necessary to explicitly set the swap_size for BOSH when running this procedure. The problem is that enabling crashdump takes away memory from the machine, which is how the bosh-agent automatically calculates the swap size partition. So when we remove memory but use the same partition, the swap filesystem thinks it has more size than actual disk space.

Comments

Powered by Zendesk