Raid1 failure

tlicht

56
+0/-0
Imagination is more important than knowledge.

Raid1 failure

« on: February 03, 2008, 09:08:42 AM »

Hi,

SME 6.5 is not happy with one of the raid1 disks:

no spare disk to reconstruct array! -- continuing in degraded mode
Can I just install (=connect the hardware) another disk and hope that SME partitions, formats and incorporates it in the raid?

If so, can I just remove the problem disk when SME has finished with the above?

Thank you for invaluable advice.

Logged

Gaston94

184
+0/-0

Re: Raid1 failure

« Reply #1 on: February 03, 2008, 12:31:32 PM »

Hi,
I guess, for SME 6 version, you have to manage the stuff by yourself :
- create similar disk partition on your new disk ("sfdisk -d > bar" to dump current one and "sfdisk < bar" to apply it to the new disk)
- add these to your raid devices (raidhotremove and raihotadd /dev/mdxxx /dev/sdxxxx - cross check syntax)
- let every thing getting back to normal.

You might search the forum for more detailled answers

[edit] I cannot remember how sme6.5 raid was managed, raidhotadd (like sme6 and prior) or mdam (like sme7 and later), I checked and I find both on my (noraid, only one disk

) 6.5 server
May be http://wiki.contribs.org/Raid#Raid_Notes , will be more accurate, apologize, I have not the answer (I think I would go with mdadm , but this is not my server, I can't make the choice ...)[/edit]

G.

« Last Edit: February 03, 2008, 01:53:19 PM by Gaston94 »

Logged

tlicht

56
+0/-0
Imagination is more important than knowledge.

Re: Raid1 failure

« Reply #2 on: February 03, 2008, 08:00:07 PM »

Thanx Gaston94,

Sounds not too hard

but.... I now managed to find another 120GB disk which is not from the same manufacturer. Naturally the original and the replacement disks are slightly different in size....
hda has 15017 cylinders
hdc has 14593 cylinders

is there a way to non-destructively shrink the original disk to 14593 cylinders?
In the PC world there is Partition Magic.... or could it be used also for this?

Cheers

Logged

raem

3,972
+4/-0

Re: Raid1 failure

« Reply #3 on: February 04, 2008, 07:34:05 AM »

tlicht

The most important thing you should do FIRST is make a full system backup. At least now you have a working intact system with data, albeit in single drive degraded mode.

You can add a larger disk to the degraded array (& just sacrifice it's extra space) but not add a smaller disk.
Are you saying your working disk is larger than the spare (120Gb) disk you want to add in ?

There was a RAID recovery Howto by Darrell May but it's not on his contribs site anymore. My copy appears to be corrupted so I will have to search through my archive backups for it.

Here is a different but very old Howto from 2001 (extracts), which should still be applicable to sme6.x

If your disks are IDE, ensure both disks are configured as the master device, one on each of the IDE buses.

Ensure both disks have the same geometry. (Ideally the exact same make and model of drive will be used.)

Be aware that there is currently no automatic failure detection.

2. Detecting a RAID array failure

2.1. Failure Modes
There are two types of failure that may occur with respect to Software RAID. The first is a single disk failure, in which the array is "degraded", and the second is multiple disk failure, in which both drives have failed. This HOWTO will not get into detail about multiple disk failure, but will detail how to repair the more common failure, a degraded array.

2.2. Detection of RAID array failure
Currently there exists no automatic detection for degraded arrays. It is up to the administrator to prepare a script to monitor the array, or perform regular checkups to verify a healthy system.

There are three ways that an administrator can detect a degraded array. One is by reviewing the system log /var/log/messages for kernel warnings, and another is by viewing the "kernel ring buffer" with /bin/dmesg. The best way to query the status of the array is by examing the file /proc/mdstat.

2.2.1. /proc/mdstat
To view what the file /proc/mdstat contains, 'cat' the file:

[root@vmware-pkn2]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdd1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdd5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdd6[1] hda6[0] 9735232 blocks [2/2] [UU]
unused devices: <none>

The contents of /proc/mdstat as shown above indicates a healthy disk array. There are two indicators. The first is the presence of two partitions per raid partition. For example, the raid partition md2 is an active raid partition containing the two hard disk partitions, hdd1 and hda1.

The second indicator is the [UU] bit. A degraded array, or an array that is in the process of being built will contain [U_].

Compare the contents of /proc/mdstat from a machine with a degraded disk array:

[root@vmware-pkn2]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[0] 264960 blocks [2/1] [U_]
md0 : active raid1 hda5[0] 15936 blocks [2/1] [U_]
md1 : active raid1 hda6[0] 9735232 blocks [2/1] [U_]
unused devices: <none>

Note only hda partitions are present in the raid partitions.

2.2.2. The system log (/var/log/messages)
A scan (or grep) of the system log /var/log/messages will also provide clues that an array has failed:

[root@vmware-pkn2]# grep -e raid -e degraded /var/log/messages
May 16 13:02:59 vmware-pkn2-55b1SL kernel: raid1: device hda5 operational as mirror 0
May 16 13:02:59 vmware-pkn2-55b1SL kernel: raid1: md0, not all disks are operational -- trying to recover array
May 16 13:02:59 vmware-pkn2-55b1SL kernel: md1: no spare disk to reconstruct array! -- continuing in degraded mode

The first kernel warning indicates that disk hdd has failed, and that disk hda is still operational.

2.2.3. The kernel ring buffer (/bin/dmesg)
When the output of /bin/dmesg is examined, a degraded array is easily detectable. grep through the log for 'degraded' to detect a failing RAID array.

[root@vmware-pkn2]# /bin/dmesg | grep degraded
md2: no spare disk to reconstruct array! -- continuing in degraded mode
md0: no spare disk to reconstruct array! -- continuing in degraded mode

2.3. Simulating a degraded RAID array
To simulate a degraded array, power down your server and unplug one of the two drives. Reboot the server and examine /proc/mdstat/. Following the instructions in Section 3 "Repairing a degraded RAID array" to repair the array.

3. Repairing a degraded RAID array
If you have detected that one of your RAID disks has failed, you can refer to the following instructions in order to repair the array.

At a convenient time, shutdown your SME Server and replace the faulty disk. The new disk should have the same geometry as both the old disk and the current working disk. Boot the server.

Switch to a login prompt (press Alt+F2 if you are viewing the console) and login as root.

Partition the new disk. It should be partitioned exactly the same as the other disk. Use the following command to determine the current partition details for the working disk /dev/hdd:

   fdisk -l /dev/hdd

You should see details similar to:


   Disk /dev/hdd: 64 heads, 63 sectors, 1015 cylinders
   Units = cylinders of 4032 * 512 bytes

     Device Boot Start End Blocks Id System
   /dev/hdd1 * 1 131 264064+ fd Linux raid autodetect
   /dev/hdd2 132 1015 1782144 5 Extended
   /dev/hdd5 132 137 12064+ fd Linux raid autodetect
   /dev/hdd6 138 1015 1770016+ fd Linux raid autodetect

Set up the identical partitions on /dev/hda using the command

   fdisk /dev/hda

Use the fdisk -l command to double check to make sure the partitions are exactly the same as those on the working disk, /dev/hdd.

Determine which partitions have been mirrored. Look at the file /proc/mdstat, where you should see something like this (note that this file is from a working system and not one that has failed):

   # cat /proc/mdstat
   Personalities : [raid1]
   read_ahead 1024 sectors
   md2 : active raid1 hdd1[1] hda1[0] 264000 blocks [2/2] [UU]
   md0 : active raid1 hdd5[1] hda5[0] 11968 blocks [2/2] [UU]
   md1 : active raid1 hdd6[2] hda6[0] 1769920 blocks [2/2] [UU]
   unused devices: <none>

This file indicates that you have three "meta-devices" that are mirrored:

md0 - using hdd5 and hda5

md1 - using hdd6 and hda6

md2 - using hdd1 and hda1

Re-attach the partitions from the new disk to the RAID devices:

   /sbin/raidhotadd /dev/md0 /dev/hda5
   /sbin/raidhotadd /dev/md1 /dev/hda6
   /sbin/raidhotadd /dev/md2 /dev/hda1

You can see the progress of the raid resyncronization by examining /proc/mdstat. The following example output shows that both /dev/md0 and /dev/md2 are fully synchronized and /dev/md1 is 58% synchronized.

   # cat /proc/mdstat
   Personalities : [raid1]
   read_ahead 1024 sectors
   md2 : active raid1 hdd1[1] hda1[0] 264000 blocks [2/2] [UU]
   md0 : active raid1 hdd5[1] hda5[0] 11968 blocks [2/2] [UU]
   md1 : active raid1 hdd6[2] hda6[0] 1769920 blocks [2/1] [U_] recovery=58% finish=2.6min
   unused devices: <none>

Logged

...

raem

3,972
+4/-0

Re: Raid1 failure

« Reply #4 on: February 04, 2008, 07:56:44 AM »

tlicht

Another/better option may be to upgrade to sme7.3.

Remove the current working disk (see wiki article referred to below for steps to do first).
Install new matching disks and do a sme7.3CD install in RAID1 mode.
Connect and mount the original disk.
Then use the Restore from disk function to restore directly from the old sme6.5 disk, taking care to follow all recommendations re changes to make/contribs to uninstall/custom templates to remove etc.
See
http://wiki.contribs.org/UpgradeDisk

You then have a sme7.3 system that supports menu selectable array rebuilding, plus lot's of improved features, and all your data is still intact on the old drive (just in case).

Logged

...

tlicht

56
+0/-0
Imagination is more important than knowledge.

Re: Raid1 failure

« Reply #5 on: February 04, 2008, 09:24:55 PM »

Thank you really Ray,

Your extensive description has taught me a lot!. As I wrote in another thread, I managed to restore the original disk, I hope it will last for some time still. During my endeavors I found that the power supply seemed to be a little shaky, so maybe that was the cause of the disk problem.
The reason I didn't go for 7.3 is that I am not certain how the stricter password policy would affect the transferred server. In 6.5 the users were allowed to use three character passwords without the OS protesting - and most of them do have so short pws. Another problem was that I didn't manage to make a proper backup. I could copy all the ibays to another computer but the real export from the server manager halted somewhere at 90% - no idea why.

Logged

raem

3,972
+4/-0

Re: Raid1 failure

« Reply #6 on: February 04, 2008, 10:59:20 PM »

tlicht

Quote

I could copy all the ibays to another computer but the real export from the server manager halted somewhere at 90% - no idea why.

Probably due to the 2Gb limit/issue.
You need to use one of the backup contribs eg backup2ws for sme6.x, and split the backup file into parts of less than 2Gb. It's all done in the server manager panel that comes with the contrib.

See
http://distro.ibiblio.org/pub/linux/distributions/smeserver/contribs/dmay/smeserver/6.x/contrib/backup2/

If/when you do upgrade to sme7.x, then change your user passwords first, and then the upgrade will go smoothly.
I suggest you use the Restore from Disk method, to a freshly installed sme7.x, as that way you get the improved drive partition arrangement and can do the semi auto drive array rebuild routine if and when required. The old partitions in sme6.x do not support this, and if you do a straight upgrade of the existing machine, you are still stuck with the old partitions.

Logged

...

warren

293
+0/-0

Re: Raid1 failure

« Reply #7 on: February 06, 2008, 09:14:11 AM »

Quote

There was a RAID recovery Howto by Darrell May but it's not on his contribs site anymore. My copy appears to be corrupted so I will have to search through my archive backups for it.

Copy of the HowTo in question:

Quote

RAID1 Recovery HowTo
Suitable for: e-smith 4.1.2/Mitel SME5

Author: Darrell May
Contributor:

Problem: You want to easily recover from a RAID1 failure.

Solution: Implement the steps outlined in the RAID1 Monitor HowTo. Next follow these steps:

--------------------------------------------------------------------------------

STEP 1: Backup your computer!

I can not stress this point strongly enough. Your first priority on a failed RAID1 system should be to perform an immediate backup.

So, DO IT NOW!

[root@myezserver /root]# /sbin/e-smith/backup

--------------------------------------------------------------------------------

STEP 2: Power down, replace the failed drive, power up.

First, before we continue, I just want to show you that for testing purposes only, to completely erase a drive, do the following:

[root@myezserver /root]# dd if=/dev/zero of=/dev/hdb

This will write zeroes across the entire /dev/hdb drive. Remember for all command-line entries in this HowTO to substitute your correct /dev/hdX where:
/dev/hda = primary master
/dev/hdb = primary slave
/dev/hdc = secondary master
/dev/hdd = secondary slave

--------------------------------------------------------------------------------

Step 3: Recover the partition information and use this information to quickly prepare the replacement drive.

[root@myezserver /root]# cat /root/raidmonitor/sfdisk.out
# partition table of /dev/hda
unit: sectors

/dev/hda1 : start= 63, size= 530082, Id=fd, bootable
/dev/hda2 : start= 530145, size=39487770, Id= 5
/dev/hda3 : start= 0, size= 0, Id= 0
/dev/hda4 : start= 0, size= 0, Id= 0
/dev/hda5 : start= 530208, size= 32067, Id=fd
/dev/hda6 : start= 562338, size=39455577, Id=fd
# partition table of /dev/hdb
unit: sectors

/dev/hdb1 : start= 63, size= 530082, Id=fd, bootable
/dev/hdb2 : start= 530145, size=39487770, Id= 5
/dev/hdb3 : start= 0, size= 0, Id= 0
/dev/hdb4 : start= 0, size= 0, Id= 0
/dev/hdb5 : start= 530208, size= 32067, Id=fd
/dev/hdb6 : start= 562338, size=39455577, Id=fd
Cut and paste your correct # partition table of /dev/hdX. In my case I am replacing /dev/hdb so this is the information I need to transfer into a file for quick import:

[root@myezserver /root]# pico hdb.out

Which now contains the following entries, right?:

# partition table of /dev/hdb
unit: sectors

/dev/hdb1 : start= 63, size= 530082, Id=fd, bootable
/dev/hdb2 : start= 530145, size=39487770, Id= 5
/dev/hdb3 : start= 0, size= 0, Id= 0
/dev/hdb4 : start= 0, size= 0, Id= 0
/dev/hdb5 : start= 530208, size= 32067, Id=fd
/dev/hdb6 : start= 562338, size=39455577, Id=fd

Next perform the partition table import using the sfdisk command as shown below:

[root@myezserver /root]# sfdisk /dev/hdb < hdb.out
Checking that no-one is using this disk right now ...
OK

Disk /dev/hdb: 2491 cylinders, 255 heads, 63 sectors/track
Old situation:
Empty

New situation:
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/hdb1 * 63 530144 530082 fd Linux raid autodetect
/dev/hdb2 530145 40017914 39487770 5 Extended
/dev/hdb3 0 - 0 0 Empty
/dev/hdb4 0 - 0 0 Empty
/dev/hdb5 530208 562274 32067 fd Linux raid autodetect
/dev/hdb6 562338 40017914 39455577 fd Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(.)

--------------------------------------------------------------------------------

STEP 4: Review your last known good RAID configuration:

[root@myezserver /root]# /usr/local/bin/raidmonitor -v

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[0] 264960 blocks [2/1] [U_]
md0 : active raid1 hda5[0] 15936 blocks [2/1] [U_]
md1 : active raid1 hda6[0] 19727680 blocks [2/1] [U_]
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>

--------------------------------------------------------------------------------

STEP 5: Add your newly prepared and correctly partitioned hard drive into the RAID1 array. You use the information above as your guide:

[root@myezserver /root]# /sbin/raidhotadd /dev/md2 /dev/hdb1
[root@myezserver /root]# /sbin/raidhotadd /dev/md0 /dev/hdb5
[root@myezserver /root]# /sbin/raidhotadd /dev/md1 /dev/hdb6

--------------------------------------------------------------------------------

STEP 6: Use raidmonitor to watch the recovery process. Note this information will also be e-mailed to root every 15 min. until the recovery is completed.

[root@myezserver /root]# /usr/local/bin/raidmonitor -v

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[0] 19727680 blocks [2/1] [U_] recovery=5% finish=10.0min
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>

--------------------------------------------------------------------------------

STEP 7: Recover and restore the last known good master boot record (MBR) onto the drive you just replaced:
[root@myezserver /root]# /sbin/lilo -C /root/raidmonitor/lilo.conf -b /dev/hdb

--------------------------------------------------------------------------------

STEP 8: Shutdown the server, reboot and test the RAID functions

If you have the time, you should test the RAID functionality to make sure the server will boot under simulated hdd failures.

start by booting with both drives attached
power down, disconnect one of the drives, power up, check boot
power down, reconnect the drive, power up and rebuild the array as above repeating steps 5 and 6 only
power down, disconnect the other drive, power up, check boot
power down, reconnect the drive, power up and rebuild the array as above repeating steps 5 and 6 only
OK, now you can confidently say your ready for anything. Remember if anything goes wrong here, you simply reconnect all the hardware, perform a fresh RAID install and then restore from your backup tape. You did perform STEP 1 correct?

--------------------------------------------------------------------------------

STEP 9: When all looks well, re-initialze raidmonitor:
[root@myezserver /root]# /usr/local/bin/raidmonitor -iv

--------------------------------------------------------------------------------

STEP 10: Go have drink. Job well done ;->

--------------------------------------------------------------------------------

/quote]

Logged

raem

3,972
+4/-0

Re: Raid1 failure

« Reply #8 on: February 06, 2008, 09:59:46 AM »

warren

Thanks for that.

There was a companion RAID Monitor HOWTO by Darrell May as well.
My copy is also corrupted, so could you post a copy of that too.
Thanks.

Logged

...

warren

293
+0/-0

Re: Raid1 failure

« Reply #9 on: February 06, 2008, 12:06:19 PM »

Quote from: RayMitchell on February 06, 2008, 09:59:46 AM

There was a companion RAID Monitor HOWTO by Darrell May as well.
My copy is also corrupted, so could you post a copy of that too.
Thanks.

Ray see Below :

Quote

RAID1 Monitor HowTo
Suitable for: e-smith 4.1.2/Mitel SME5
Author: Darrell May
Contributor:
Problem: You want to enable automatic monitoring of your RAID1 array.

Solution: dmc-mitel-raidmonitor-0.0.1-5.noarch.rpm

--------------------------------------------------------------------------------

STEP 1: install the rpm
[root@e-smith]# rpm -ivh dmc-mitel-raidmonitor-0.0.1-5.noarch.rpm

--------------------------------------------------------------------------------

STEP 2: initialize and activate raidmonitor
[root@e-smith]# /usr/local/bin/raidmonitor -iv

A report similar to the following will print to the display:

...................
RAID Monitor Report
...................

Current /proc/mdstat configuration saved in /root/raidmonitor/mdstat:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>

Current partition configuration saved in /root/raidmonitor/sfdisk.out:

# partition table of /dev/hda
unit: sectors

/dev/hda1 : start= 63, size= 530082, Id=fd, bootable
/dev/hda2 : start= 530145, size=39487770, Id= 5
/dev/hda3 : start= 0, size= 0, Id= 0
/dev/hda4 : start= 0, size= 0, Id= 0
/dev/hda5 : start= 530208, size= 32067, Id=fd
/dev/hda6 : start= 562338, size=39455577, Id=fd
# partition table of /dev/hdb
unit: sectors

/dev/hdb1 : start= 63, size= 530082, Id=fd, bootable
/dev/hdb2 : start= 530145, size=39487770, Id= 5
/dev/hdb3 : start= 0, size= 0, Id= 0
/dev/hdb4 : start= 0, size= 0, Id= 0
/dev/hdb5 : start= 530208, size= 32067, Id=fd
/dev/hdb6 : start= 562338, size=39455577, Id=fd

Current cron entry is:

15 * * * * root /usr/local/bin/raidmonitor -v

Stopping crond: [ OK ]
Starting crond: [ OK ]

--------------------------------------------------------------------------------

STEP 3: Ok what's next?
For usage simply enter raidmonitor without args:

[root@e-smith /root]# /usr/local/raidmonitor
usage: /usr/local/bin/raidmonitor [-iv]
-i initialize raidmonitor
-v verify RAID configuration matches and displays ALARM message if mismatch found

Let me explain what we are doing here.

/root/raidmonitor/mdstat is a snapshot of your last known good /proc/mdstat RAID config.
raidmonitor is now running as a cron job and checks every 15 minutes to compare if the current /proc/mdstat matches the last known good snapshot /root/raidmonitor/mdstat.
If there is any change in /proc/mdstat than your RAID has experienced a problem and root is e-mailed an !ALARM! RAID configuration problem error message. Here is an example:
ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[0] 264960 blocks [2/1] [U_]
md0 : active raid1 hda5[0] 15936 blocks [2/1] [U_]
md1 : active raid1 hda6[0] 19727680 blocks [2/1] [U_]
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>
As you can see, this shows in our current configuration that we have lost our second RAID1 ide drive [2/1] [U_]. Note if we had lost our first drive, these would be reversed [1/2] [_U].

On to recovery.....

Logged

tlicht

56
+0/-0
Imagination is more important than knowledge.

Re: Raid1 failure

« Reply #10 on: February 06, 2008, 12:21:28 PM »

Thanx Warren,

I'll keep a copy of this on my now extremely safe

SME server for future need.

Logged

raem

3,972
+4/-0

Re: Raid1 failure

« Reply #11 on: February 06, 2008, 02:20:59 PM »

warren

Thanks also Warren

Logged

...