Can I prevent a Linux server from locking up/spewing console errors when a hard drive fails?

I have a Linux server (CentOS 5.5) that has two identical IDE hard drives. I’ve used software RAID (mdadm) to create mirrors for each filesystem, so that either hard drive could fail and no data would be lost.

Today one of my hard drives failed. The whole point of RAID should be to allow the system to keep running when this happens; but what happened instead was that the console began spewing the same 4 lines over and over:

hdb: task_out_intr: status=0x61 { DriveReady DeviceFault Error }
hdb: task_out_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
ide0: reset: success

Due to the high rate of errors being produced, the console was unusable. I was able to SSH in, but the first command I tried just hung. I SSH’ed in again and tried to reboot, but that got hung up as well. Ultimately I had to physically reset the machine.

I know how to remove the failed drive from the MD and replace it, etc. But having the machine lock up and become unusable in this situation seems to defeat the whole point of having RAID mirrors in the first place.

Is this just the way the Linux kernel always behaves in this situation? Or is there some way to configure the kernel so that when a hard drive fails, it rate-limits the errors being produced, and doesn’t prevent the machine from being used and cleanly rebooted?

Answer

I haven’t run into this, but since you’re using software RAID, it’s possible that the hard disk failure is causing something to interfere with I/O on the disk controller, so you’re getting other failures like the locking up of commands.

The data should be intact (unless it’s corrupted, in which case you have duplicated corruption). If the drive itself failed you should be able to power down, remove the bad drive, power back up and hopefully things will come back online with a broken mirror set.

Sounds to me like the nature of the failure isn’t sitting well with the controller. Take out the bad drive. It doesn’t do you any good to keep it in there and can be causing more harm.

Attribution
Source : Link , Question Author : bjnord , Answer Author : Bart Silverstrim

Leave a Comment