Koozali.org: home of the SME Server

raid error

djnapkin

raid error
« on: September 26, 2006, 05:47:30 AM »
Hi,
I am relatively a newbie to SME. I have a server that is sending message to the admin account every 15min

Subject: Cron <root@"servername"> /usr/local/bin/raidmonitor -v

Message:
 
ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde3[0]
      264960 blocks [2/1] [U_]
     
md1 : active raid1 hde2[0]
      116848704 blocks [2/1] [U_]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde3[0] hdg3[1]
      264960 blocks [2/2] [UU]
     
md1 : active raid1 hde2[0] hdg2[1]
      116848704 blocks [2/2] [UU]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>
I realise that the [U_] indicates which partitions are not synced?
How can I fix it.

I have tried
-reboot
-the /sbin/raidhotremove .../raidhotadd precedure I found in other posts but it will not do it saying the hdd is in use and it can not do it.

Can anyone give me some advice and a step-by-step that a relative newb could follow.

Thanks in advance.

Offline Gaston94

  • ****
  • 184
  • +0/-0
Re: raid error
« Reply #1 on: September 26, 2006, 02:43:31 PM »
Hi,
Quote from: "djnapkin"

md2 : active raid1 hde3[0]
      264960 blocks [2/1] [U_]
     
md1 : active raid1 hde2[0]
      116848704 blocks [2/1] [U_]

as you state it does meens that two of your raid devices are broken

You should not have get the "in use" message. Could you confirm you have typed the following command :
Code: [Select]

raidhotadd /dev/md2 /dev/hdg3
raidhotadd /dev/md1 /dev/hdg2
it's easy to try to remove the "correct" disk instead of the faulty one
you do not have to remove them from the raid device, as they are not member any more.

Rgds
G.

Offline raem

  • *
  • 3,972
  • +4/-0
Re: raid error
« Reply #2 on: September 26, 2006, 05:00:39 PM »
djnapkin

There may be no real drive problem, just a glitch
You could try re-initializing the raidmonitor
/usr/local/bin/raidmonitor -iv

Refer to step 9 in Darrell Mays Recovery howto in the contribs area.
http://mirror.contribs.org/smeserver/contribs/dmay/smeserver/5.x/contrib/raidmonitor/raid-recovery-howto.html

If that doesn't work, then I would run a disk checking test on your hdg drive
...

djnapkin

raid error
« Reply #3 on: September 27, 2006, 09:10:21 AM »
Thank you both for your help.

I performed the raidhotadd commands as per Gaston94's suggestion. I must have been typing the wrong syntax originally. It now worked and i watched it rebuild the array. However now it is coming up with this.

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdg3[1] hde3[0]
264960 blocks [2/2] [UU]

md1 : active raid1 hdg2[1](F) hde2[0]
116848704 blocks [2/1] [U_]

md0 : active raid1 hde1[0] hdg1[1]
104320 blocks [2/2] [UU]

unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde3[0] hdg3[1]
264960 blocks [2/2] [UU]

md1 : active raid1 hde2[0] hdg2[1]
116848704 blocks [2/2] [UU]

md0 : active raid1 hde1[0] hdg1[1]
104320 blocks [2/2] [UU]

unused devices: <none>

Does this mean one of the HDD's is dodgy?
Any suggestions?

Offline Gaston94

  • ****
  • 184
  • +0/-0
raid error
« Reply #4 on: September 27, 2006, 12:06:32 PM »
Hi djnapkin,
you can perform again the following command
Code: [Select]
raidhotremove /dev/md1 /dev/hdg2
raidhotadd /dev/md1 /dev/hdg2

This will remove the disk from the raid device and add it again.
monitor the raid rebuilt with cat /proc/mdstat
Should it get back as failed, you reallyu might consider changing your disk :(

I would also recommend looking at your SMART (smartctl -v) informations for this disk.

Rgds
Gaston

djnapkin

raid error
« Reply #5 on: September 28, 2006, 05:44:17 AM »
Once again Gaston94, you are entirely correct. raidmonitor now reads;

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdg3[1] hde3[0]
      264960 blocks [2/2] [UU]
     
md1 : active raid1 hdg2[1] hde2[0]
      116848704 blocks [2/2] [UU]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde3[0] hdg3[1]
      264960 blocks [2/2] [UU]
     
md1 : active raid1 hde2[0] hdg2[1]
      116848704 blocks [2/2] [UU]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>

How come hde3/hdg3 and and hde2/hdg2 have swapped around?
Am I safe to reinitialise raidmonitor?

Offline Gaston94

  • ****
  • 184
  • +0/-0
raid error
« Reply #6 on: September 28, 2006, 10:50:03 AM »
Hi,
Quote from: "djnapkin"

How come hde3/hdg3 and and hde2/hdg2 have swapped around?
Am I safe to reinitialise raidmonitor?

last time you reboot your system, hdg was not OK so it did not get back as a member of the
raid config (at this time you would have see from the mdstat that the md array was only built from
one disk. Something like
Code: [Select]
md2 : active raid1 hde3[0]
264960 blocks [2/1] [_U]
(not remember of the exact display, but only one disk on the line of the array)

Then you rebuilt your raid array and your mtab informations get updated accordingly (from left to right).
For me it's just a cosmetic issue (I guess it might it get back as previously with a reboot).
And from past experiences, you can safely run
Code: [Select]
raidmonitor -iv

G.

djnapkin

raid error
« Reply #7 on: September 29, 2006, 03:53:20 AM »
It has failed again with:

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdg3[1] hde3[0]
      264960 blocks [2/2] [UU]
     
md1 : active raid1 hdg2[1](F) hde2[0]
      116848704 blocks [2/1] [U_]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde3[0] hdg3[1]
      264960 blocks [2/2] [UU]
     
md1 : active raid1 hde2[0] hdg2[1]
      116848704 blocks [2/2] [UU]
     
md0 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]
     
unused devices: <none>

Also this in the messages log:

Sep 28 18:04:27 servername kernel: hdg: 0 bytes in FIFO
Sep 28 18:04:27 servername kernel: hdg: timeout waiting for DMA
Sep 28 18:04:27 servername kernel: hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 28 18:04:27 servername kernel: hdg: dma_intr: error=0x40 { UncorrectableError }, LBAsect=359749, high=0, low=359749, Sector=150904
Sep 28 18:04:27 servername kernel: end_request: I/O error, dev 22:02 (hdg), sector 150904
Sep 28 18:04:27 servername kernel: raid1: Disk failure on hdg2, disabling device.
Sep 28 18:04:27 servername kernel: ^IOperation continuing on 1 devices
Sep 28 18:04:27 servername kernel: raid1: hdg2: rescheduling block 150904
Sep 28 18:04:27 servername kernel: md: updating md1 RAID superblock on device
Sep 28 18:04:27 servername kernel: md: (skipping faulty hdg2 )
Sep 28 18:04:27 servername kernel: md: hde2 [events: 0000001d]<6>(write) hde2's sb offset: 116848704
Sep 28 18:04:27 servername kernel: md: recovery thread got woken up ...
Sep 28 18:04:27 servername kernel: md1: no spare disk to reconstruct array! -- continuing in degraded mode
Sep 28 18:04:27 servername kernel: md: recovery thread finished ...
Sep 28 18:04:27 servername kernel: raid1: hde2: redirecting sector 150904 to another mirror

I guess the HDD is no good. I'll try replacing it and let you know here how it goes. Thanks again for all your help.

Offline raem

  • *
  • 3,972
  • +4/-0
raid error
« Reply #8 on: September 29, 2006, 04:01:36 AM »
djnapkin

> It has failed again .... I guess the HDD is no good.

I did suggest that a couple of days ago but you chose not to follow my advice
 
> You could try re-initializing the raidmonitor
> /usr/local/bin/raidmonitor -iv
> If that doesn't work, then I would run a disk checking test on your hdg > drive
...

djnapkin

raid error
« Reply #9 on: September 29, 2006, 04:40:48 AM »
Quote
I did suggest that a couple of days ago but you chose not to follow my advice

> You could try re-initializing the raidmonitor
> /usr/local/bin/raidmonitor -iv
> If that doesn't work, then I would run a disk checking test on your hdg > drive


I was of the understanding that re-initializing raidmonitor just updates the last known good config with the current config thus not resolving the actual problem, just making it not so obvious. If I am mistaken in that I am sorry for not following your advice earlier.

Offline raem

  • *
  • 3,972
  • +4/-0
raid error
« Reply #10 on: September 29, 2006, 05:32:00 AM »
djnapkin

You ignored the part where I said
"I would run a disk checking test on your hdg drive"
...