Pivotal Knowledge Base

Follow

Server becomes unresponsive after rebooting using "shutdown -r" command

Environment

  • DCA 2.0.1.0
  • DCA 2.0.2.0
  • Red Hat Enterprise Linux 6.1 (kernel-2.6.32-131.26.1.el6 and newer)
  • Red Hat Enterprise Linux 6.2 (kernel-2.6.32-220.4.2.el6 and newer)
  • Red Hat Enterprise Linux 6.3 (kernel-2.6.32-279 series)
  • Red Hat Enterprise Linux 6.4 (kernel-2.6.32-358 series)
  • Any Intel® Xeon® E5, Intel® Xeon® E5 v2, or Intel® Xeon® E7 v2 series processor

Problem

-- After warm rebooting("shutdown -r" or "reboot") of a server that had been running for long time (200+ days), it becomes unresponsive/hung or incurs a kernel panic.

-- The server responds to ping request but doesn't responds to other connection requests, like ssh.

-- /var/log/messages show stack traces similar to below:

INFO: task bash:12543 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D 0000000000000012 0 12543 12542 0x00000084
ffff880c343b3ce8 0000000000000082 ffff880c343b3d98 ffffffffffffffe9
ffff880c343b3c88 ffffffffa00c9129 ffff880c343f4aa0 0000010100000015
ffff880c343f5058 ffff880c343b3fd8 000000000000fb88 ffff880c343f5058
Call Trace:
[<ffffffffa00c9129>] ? ext4_check_acl+0x29/0x90 [ext4]
[<ffffffffa008fbf0>] ? ext4_file_open+0x0/0x130 [ext4]
[<ffffffff8150ea05>] schedule_timeout+0x215/0x2e0
[<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70
[<ffffffff81277379>] ? cpumask_next_and+0x29/0x50
[<ffffffff8150e683>] wait_for_common+0x123/0x180
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffff8150e79d>] wait_for_completion+0x1d/0x20
[<ffffffff8106513c>] sched_exec+0xdc/0xe0
[<ffffffff8118a0a0>] do_execve+0xe0/0x2c0
[<ffffffff810095ea>] sys_execve+0x4a/0x80
[<ffffffff8100b4ca>] stub_execve+0x6a/0xc0

Cause

On Intel® Xeon® Processor E5 Family 6 Model 45 (also known as SandyBridge), the Time Stamp Counter (TSC) is not cleared by a warm reset. This is documented in the Intel® Xeon® Processor E5 Family Specification Update as erratum BT81.

When RHEL 6 kernel introduced fix for "[sched] x86: Avoid unnecessary overflow in sched_clock (...) [765720]", this processor errata causes this symptom.

Workaround

If reboot of a server is required, make it cold rebooted(hard reset) instead of just executing "shutdown -r" or "reboot"

# ipmitool -I lan -H <node>-sp -U root -P sephiroth chassis power off

<-- give it 30 secs

# ipmitool -I lan -H <node>-sp -U root -P sephiroth chassis power on

If you want to proactively workaround this issue, ie if you have determined that a server node is susceptible to the bug, instead of powering off the server hard using the above ipmitool commands, gracefully shutdown the OS using the 'shutdown -h now' command.

Login to the OS :

# shutdown -h now

Once the OS is shutdown, run the ipmitool commands from above.

# ipmitool -I lan -H <node>-sp -U root -P sephiroth chassis power off

<-- give it 30 secs

# ipmitool -I lan -H <node>-sp -U root -P sephiroth chassis power on

Solution

Upgrade the DCA to version 2.0.3.0 or higher which has kernel level not affected by this issue.

Please refer to RedHat KB article for more detail, Internal bug reference DCA-7727

Comments

  • Avatar
    Narendra Jonna

    This is awesome, Sangdon. Helped me resolve a case :)

Powered by Zendesk