Pivotal Knowledge Base

Follow

Virtual Disk (VD) Reports Critical Status, but all Physical Disks (PD) Report OK Status

Environment 

Product Version
DCA V1

Purpose

Virtual Disk (VD) reports a Critical status.

Name | Status | State| Encrypted| Layout| Media| Read Policy | Write Policy| Stripe Element Size
 Virtual Disk 0| Critical | Ready| No | RAID-5| HDD | Adaptive Read Ahead| Write Back | 128 KB
Virtual Disk 1| Ok | Ready| No | RAID-5| HDD | Adaptive Read Ahead| Write Back | 128 KB
Virtual Disk 2| Ok | Ready| No | RAID-5| HDD | Adaptive Read Ahead| Write Back | 128 KB
Virtual Disk 3| Ok | Ready| No | RAID-5| HDD | Adaptive Read Ahead| Write Back | 128 KB

Although VD reports a critical status, physical disks (PDs) are all reported Online. 

ID    | Status      | Name                | State | Power Status| Failure Predicted| Capacity                      | Hot Spare| Vendor ID
0:0:0 | Ok          | Physical Disk 0:0:0 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:1 | Ok          | Physical Disk 0:0:1 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:2 | Ok          | Physical Disk 0:0:2 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:3 | Ok          | Physical Disk 0:0:3 | Online| Spun Up     | Yes              | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:4 | Ok          | Physical Disk 0:0:4 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:5 | Ok          | Physical Disk 0:0:5 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:6 | Ok          | Physical Disk 0:0:6 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:7 | Ok          | Physical Disk 0:0:7 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:8 | Ok          | Physical Disk 0:0:8 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:9 | Ok          | Physical Disk 0:0:9 | Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:10| Ok          | Physical Disk 0:0:10| Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)
0:0:11| Ok          | Physical Disk 0:0:11| Online| Spun Up     | No               | 558.38 GB (599550590976 bytes)| No       | DELL(tm)

Also, SCSI medium errors are reported in /var/log/messages file. 

Cause

The Virtual Disk status is set to "Critical" while Physical Disks are "Ok' is due to one of the following messages seen in the controller log files:

  • Event Description: Puncturing bad block on PD
  • Event Description: Background Initialization detected uncorrectable multiple medium errors
  • Event Description: Patrol Read found an uncorrectable medium error on PD
  • Event Description: Consistency Check detected uncorrectable multiple medium errors
  • Event Description: Double media errors found!

A typical snippet from LSI logs will look similar to the following:

1/04/14 23:40:30: EVT#25896-01/04/14 23:40:30: 97=Puncturing bad block on PD 03(e0x20/s3) at 6cacde5
01/04/14 23:40:42: EVT#25900-01/04/14 23:40:42: 97=Puncturing bad block on PD 04(e0x20/s4) at 6cad505
01/04/14 23:40:43: Unrecoverable medium error during delayed write : Puncturing the parity arm P : 1 at LBA : 6cad505
01/04/14 23:40:43: EVT#25901-01/04/14 23:40:43: 97=Puncturing bad block on PD 01(e0x20/s1) at 6cad505

Another example from LSI logs will look similar to the following:

71=Uncorrectable medium error logged for VD 01/1 at 1bf5e605 (on PD 04(e0x20/s4) at 6cad505)
01/04/14 23:41:38: EVT#25932-01/04/14 23:41:38: 271=Uncorrectable medium error logged for VD 01/1 at 1bf6a0ee (on PD 04(e0x20/s4) at 6cafaee)
01/04/14 23:41:40: EVT#25937-01/04/14 23:41:40: 271=Uncorrectable medium error logged for VD 01/1 at 1bf6a0ef (on PD 04(e0x20/s4) at 6cafaef)
01/04/14 23:41:42: EVT#25942-01/04/14 23:41:42: 271=Uncorrectable medium error logged for VD 01/1 at 1bf6a0f1 (on PD 04(e0x20/s4) at 6cafaf1)

A snippet of the 'omreport system alertlog': 

--
Severity : Critical
ID : 2273
Date and Time : Tue Jun 24 13:01:53 2014
Category : Storage Service
Description : A block on the physical disk has been punctured by the controller: Physical Disk 0:0:1 Controller 0, Connector 0
--
Severity : Critical
ID : 2273
Date and Time : Tue Jun 24 13:01:53 2014
Category : Storage Service
Description : A block on the physical disk has been punctured by the controller: Physical Disk 0:0:1 Controller 0, Connector 0
--

Resolution

In order to recover from this punctured block, the only option is to recreate the Raid Group. The recommended option is to run slow initialization to zero out and regenerate new parity causing all bad block entries to be removed from the bad block table. 

Please contact EMC/Pivotal Technical Support for the recovery procedure.   

Additional Information

1. Why does the fix for this issue require rebuilding the RAID group?

Whenever there is a punctured RAID group or there is an "Uncorrectable" medium error, this essentially means that the data cannot be recovered from the bad block - in short, this is a data loss event. The best path of recovery is to rebuild the affected RAID group. 

2. Will running the consistency check repair the "punctured" bad block?

No, the "punctured" bad block means that this is a data loss on that block and cannot be recovered. If bad block does not contain any data, then it may be possible to zero out that bad block, but this is a tedious and time-consuming process and it is faster to rebuild the raid group and run full database recovery.

3. What causes a "punctured" block?

When a Patrol Read or a Rebuild operation encounters a media error on the source drive, it punctures a block on the target drive to prevent the use of the data with invalid parity. Any subsequent read operation to the "punctured" block completes with an error. 

Usually, this happens when there are multiple drives in the same RAID group with media errors. So, when one of the disks is replaced, the regenerated block, during the rebuild is corrupted because there is not enough information to generate a good block. (invalid parity and RAID 5 operation due to media error on the source drive).

References 

Comments

  • Avatar
    Kushal Choubay

    There is another scenario where Vdisk will show critical when "Virtual Disk Bad Blocks : Yes" .

    Resolution ( need Nikhil to verify )

    (1) check the logs
    (2) run VD consistency check
    (3) replace all affected disk(s)
    (4) run the VD consistency check and the badblocks
    (5) Only if (4) returns clean and the VD state is still in Critical and Virtual Disk Bad Blocks is still set to Yes, 'omconfig storage vdisk action=clearvdbadblocks controller=0 vdisk=X (where X is the VD ID)' can be run to reset the flag. If (4) still returns errors, whole server w/ disks needs to be planned.

  • Avatar
    Josh Loar
    Kushal, I agree with steps 1 through 4. But we do not execute the action in step 5. That action only suppresses the issue instead of fixing it. Vdisk bad blocks will require a vdisk wipe (zero out) and rebuild as stated in the article.
Powered by Zendesk