Koozali.org: home of the SME Server

Hard drive failure - can’t boot!

Offline FreakWent

  • ***
  • 85
  • +0/-0
Hard drive failure - can’t boot!
« on: January 13, 2019, 03:36:19 AM »
Hello everyone,

I have a server with twin primary drives in a standard RAID-1 setup, but one has failed.

“No problem” I smugly shrug to myself, “I’ll just boot the other drive!”

However, it seems that this doesn’t have grub on it.  Assuming that the drives are synced correctly, is this normal, expected behaviour? I certainly didn’t expect it!

Can I boot a USB stick and easily fix this with a few grubby commands or is something worse likely here?

Thanks in advance for any help, I’m dead in the water until I can get this fixed!

Offline FreakWent

  • ***
  • 85
  • +0/-0
Re: Hard drive failure - can’t boot!
« Reply #1 on: January 13, 2019, 04:48:31 AM »
If it helps:

 - booting arch from a USB, then fdisk -l shows all the mountable raid partitions (ext3), if the working and failing drives are connected.
 - booting arch from a USB, then fdisk -l shows all the raid partitions (Linux raid auto), if only the working drive is connected.
 - booting from the working disk gives a flashing cursor, no other output.

I'd be happy to just start poking the drive with grub and mdadm if I had another  copy handy, but I don't; nor do I have a spare drive big enough to 'dd' this drive onto.

I'm assuming that something in grub or the raid config is choosing to boot the busted drive by uuid, and that this data exists on the working drive as well.  If so, I might be able to just change the UID and I'm all set. 

As before, any reply is welcome, even if it's just mocking laughter....

Offline ReetP

  • *
  • 3,722
  • +5/-0
Re: Hard drive failure - can’t boot!
« Reply #2 on: January 13, 2019, 12:34:14 PM »
Rule 1 with raid issues is don't start messing about unless you a) have damn good backups and b) know what you are doing

Grub is a bit odd in the way it installs itself and I don't pretend to be able to explain it here and now... it knows which number drive it is installed on and relationship to the other. Have a read on the wiki.

If you remove the bad drive your server may not boot. Leave it in place and it should, even with the bad drive.

Remember if tinkering that you have a raid boot partition as well as a raid main partition so you have got to be careful.

The simple answer which you are going to have address is get another drive and install it. The good drive still has all your data.

Put it in the same place/drive number of the bad one.

SME should boot, pick it up, add it, and re mirror your data. That will be your easiest route. No tinkering required.

Get a second to sit on the shelf for future issues.
...
1. Read the Manual
2. Read the Wiki
3. Don't ask for support on Unsupported versions of software
4. I have a job, wife, and kids and do this in my spare time. If you want something fixed, please help.

Bugs are easier than you think: http://wiki.contribs.org/Bugzilla_Help

If you love SME and don't want to lose it, join in: http://wiki.contribs.org/Koozali_Foundation

Offline janet

  • ****
  • 4,812
  • +0/-0
Re: Hard drive failure - can’t boot!
« Reply #3 on: January 13, 2019, 05:46:13 PM »
FreakWent

You need to test both drives with the drive manufacturers disk testing software.
Typically you would download it & create a boot CD/DVD disk or boot USB, & then boot up your server to that boot device.

It is possible that both disks are faulty in different ways, so test both before doing anything.

If you have one good disk, then there are instructions in wiki/forums re how to install grub, if that is your only problem.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Online Jean-Philippe Pialasse

  • *
  • 2,745
  • +11/-0
  • aka Unnilennium
    • http://smeserver.pialasse.com
Re: Hard drive failure - can’t boot!
« Reply #4 on: January 16, 2019, 02:27:00 PM »
You can use sme install disk to repear the mbr /install grub on the good drive.

All is in the wiki.


Warning, as i experienced it myself. Ir can be very easy to choose the wrong disk as the good one to use.
First as pointed by janet, both could be damaged, and if not, as theyr have the same age the second will be soon.
Second, sometime it is the most damaged that was kept in the array. I have experienced mounting the one thought to be the one in best shape and it was in fact the one that has been out of sync for the longer time. I was then happy to have backup otherwise i had lost a week or so of modification because the faulty drive have been discarded before i realised.

So as soon as you are able to boot on each disk and have them labelled to be able to recognised them, first have a look on message log or another recgnizable file to see which one has run the last in production. Sometime this is the bootable one sometime the other.


Side note, what is the history of this server? Sme should install grub on both disk, if not we need to fix that, and to do so we need to be able to reproduce your issue

Offline ReetP

  • *
  • 3,722
  • +5/-0
Re: Hard drive failure - can’t boot!
« Reply #5 on: January 16, 2019, 03:34:06 PM »
Following up on my earlier... if I remember correctly.

When grub installs it uses references to physical drives. The first drive is hd0 and the first partition hd0,1 etc.

If your 1st drive fails do not just disconnect it and expect the server to boot as it most likely won't.... the 2nd drive will become the 1st drive and grub can then get in a muddle with drive numbering.

Either leave the old drive connected and hopefully the BIOS will at least know there is a drive there and number them accordingly and it will run with the array degraded, or the better option is to replace it with a new drive.

(I got in a knot like this myself once)

If it still won't boot with either the old drive in place or a new drive in place then you as per JPs suggestion you will need to use the repair option from the install disk.

Quote
- booting arch from a USB, then fdisk -l shows all the mountable raid partitions (ext3), if the working and failing drives are connected.

So an emergency boot disk can read all the partitions on both drives?

So exactly what error did you have on the first failed boot? Just a flashing cursor, or had you disconnected the first drive? Any other errors at all - eg smart failures, bios errors at startup etc?

I'd suggest not to touch mdadm yet - that could just exacerbate the issue. If you can get it to boot on one drive, it will run as a degraded array - not good but will allow you to work, add another drive to the array etc.

...
1. Read the Manual
2. Read the Wiki
3. Don't ask for support on Unsupported versions of software
4. I have a job, wife, and kids and do this in my spare time. If you want something fixed, please help.

Bugs are easier than you think: http://wiki.contribs.org/Bugzilla_Help

If you love SME and don't want to lose it, join in: http://wiki.contribs.org/Koozali_Foundation

Online Jean-Philippe Pialasse

  • *
  • 2,745
  • +11/-0
  • aka Unnilennium
    • http://smeserver.pialasse.com
Re: Hard drive failure - can’t boot!
« Reply #6 on: January 16, 2019, 07:22:51 PM »
Usually you configure both mbr with hd(0,0) to be able to remove the other drive.
https://wiki.contribs.org/Raid:Manual_Rebuild#HowTo:_Write_the_GRUB_boot_sector


For that the wiki grub page is wrong, as it let think sdb should use hd(1,0) which would fail on first hdd removal and boot on one single drive

Also new disk has to be attached as 2nd one on first boot ;) when adding it.

Offline ReetP

  • *
  • 3,722
  • +5/-0
Re: Hard drive failure - can’t boot!
« Reply #7 on: January 16, 2019, 10:30:03 PM »
Always throws me.... !
...
1. Read the Manual
2. Read the Wiki
3. Don't ask for support on Unsupported versions of software
4. I have a job, wife, and kids and do this in my spare time. If you want something fixed, please help.

Bugs are easier than you think: http://wiki.contribs.org/Bugzilla_Help

If you love SME and don't want to lose it, join in: http://wiki.contribs.org/Koozali_Foundation