Pivotal Knowledge Base

Follow

DCA V2 : kernel: [Hardware Error]: Machine check events logged

Goal

To know what the below error message means and how to troubleshoot it.

xx xx xx:xx:xx xxxx kernel: [Hardware Error]: Machine check events logged 

- error logged on /var/log/messages


Environment

DCA V2
Red Hat Enterprise Linux 6.x
kernel-2.6.32-220.17.1.el6.x86_64
mcelog-1.0pre3_20110718-0.7.el6.x86_64

Solution

This message is harmless under the customer's hardware environment.
The customer is monitoring /var/log/messages and the above message is subject to surveillance.

But it is harmless message, so customer will ignore the above message and check /var/log/mcelog instead.

The customer would like to know if the detail information is always recorded to /var/log/mcelog when the above message is logged in /var/log/messages.

Troubleshooting

/var/log/mcelog

mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 12
MISC 4937e01c086 ADDR 17a142ba40
TIME 1431237188 Sun May 10 14:53:08 2015
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Eviction Error
STATUS 8c2000400007017a MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 45

In the above case customer has used non-standard DIMMs in the cluster, which can be ignored as customer is aware of the issue.

Resolution

  1. This is a harmless warning message. The DIMM database prefill relies on a specific nonstandard format of the DIMMs in the DMI BIOS tables. If this format is not used by the BIOS then mcelog will only discover DIMMs as they get their first error (if the CPU reports DIMMs in machine check errors).
  2. This applies to mcelog running on Intel servers
  3. mcelog has the (socketid, channel, DIMM) information from the CPU and tries to translate that into a motherboard silkscreen label using SMBIOS. The label is then logged in the log file and in the accounting database in memory.
  4. SMBIOS has no official way that works to do that translation, but on a Supermicro test system it was possible to do it by matching the non standard identifier. That is what mcelog is trying to do.

Note

Some reason where there are real problems
1. DIMM failure
Below is an example of DIMM failure reported in mcelog

Hardware event. This is not a software error.
MCE 0
not finished?
CPU 7 BANK 5 TSC a4029f5662482
RIP !INEXACT! 10:ffffffff812ea691
MISC 20405ede86 ADDR 200ead3d80
TIME 1431734326 Fri May 15 19:58:46 2015
MCG status:RIPV MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS fe000f0000010092 MCGSTATUS 5
MCGCAP 1000c14 APICID e SOCKETID 0
CPUID Vendor Intel Family 6 Model 45

2. Non-standard DIMM's used in DCA

Comments

Powered by Zendesk