Hi guys

I've got 1 drive on a raid1 going south and am looking for a sanity
check before I start working on it. I haven't done this enough to be
confident with it and this box is in use, so I really want to get it
right the first time without a lot of downtime. Long post, specific
questions at end.

It's generic PC, Debian testing, RAID1 for both / and swap. I've got
two identical Maxtor 40G's in it on hda and hdc. Both drives are
physically about two years old, one was unused until this installation
about three weeks ago and the other had little use.

Here's a slice of menu.lst

## ## End Default Options ##

title           Debian GNU/Linux, kernel 2.6.8-1-386 
root            (hd0,0)
kernel          /boot/vmlinuz-2.6.8-1-386 root=/dev/md0 ro 
initrd          /boot/initrd.img-2.6.8-1-386
savedefault
boot


Here's /proc/mdstat

mercury:~# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 hda2[0] hdc2[1]
      1952384 blocks [2/2] [UU]
      
md0 : active raid1 hda1[2](F) hdc1[1]
      37109376 blocks [2/1] [_U]
      
unused devices: <none>

As you can see,    / is on hda1 (borked) and hdc1 ,
                swap is on hda2          and hdc2
                

I hadn't gotten around to putting grub on the MBR of hdc, so I finally
got around to it:

grub>device (hd0) /dev/hdc
grub>root (hd0,0)
grub>setup (hd0)
                
                
Output looked normal, but after quitting and restarting grub. I get:

grub> find /boot/grub/stage1
 (hd0,0)

grub> 

so I'm not really sure if I have grub on hdc's MBR or not. I've been
referring to:

http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/014331.html

and it makes me think I should be also be seeing (hd1,0) in the output.


The article above also mentions using sfdisk to back and restore
partition tables along the line of:

  #sfdisk -d /dev/hda > /raidinfo/partitions.hda
  
  and:
  
  #sfdisk /dev/hda < /raidinfo/partitions.hda on the new drive

When I try that I get a device busy message, but the mdadm man pages says:

  "-r, --remove  remove listed devices.  They must not be active.  i.e. they
    should be failed or spare devices."

so I'm not sure if it's possible/advisable to make it unbusy.


Here are the questions:

1) Do I have grub on the MBR of hdc? I have a grub boot floppy so I
can probably recover if I'm wrong.

2) Is there a way to use sfdisk as indicated above while the system is
live? If not I can always I can partition manually.

3) /hda is not dead - is there any other thing I should try before
complete replacement?

4) Smartmon has been throwing warnings on _both_ drives... the case is
large, cool to the touch, plenty of airspace around each drive...
anyone care to comment on bad luck vs other possible causes?

Thanks in advance,

Steve