From iznogoud at nobelware.com  Tue Jan  2 22:06:36 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Wed, 3 Jan 2018 04:06:36 +0000
Subject: [tclug-list] raid log?
In-Reply-To: <CAD+dB9Av92fSirVhidpT_2qOVYs=3_cWVHys-E9SVtcMJBXJqA@mail.gmail.com>
References: <CAD+dB9Av92fSirVhidpT_2qOVYs=3_cWVHys-E9SVtcMJBXJqA@mail.gmail.com>
Message-ID: <20180103040636.GA13293@nobelware.com>

> 
> ah, my colleague finally clarified, a resync, not a rebuild, was noticed by
> nagios, glad of that, and i know i'd see it as it happens in /proc/mdstat,
> but surprised to see nothing in /var/log/messages, it looks like
> centos6/mdadm don't log such things?

It should be logged in messages. Resync is enough to worry about hardware
issues and really just about hardware issues. I had noticed a "correction"
down one a RAID6 once. A drive had dropped a sector (remapped it) and that
certainly triggered it. A full-blown re-sync I can only think of being worse:
https://unix.stackexchange.com/questions/153243/raid-resyncing-automatically

I am paranoid enough to trigger resyncs by hand as I had indicated before.
But, apparently, packaged system setups do it too:
https://serverfault.com/questions/255544/reason-for-automatic-raid-resync

You can get emails of events with this:
mdadm --monitor /dev/md0 -m myusername at mailserver
You can put this in /etc/rc.d/rc.local or similar place, or just run it
manually with 'at now' from the command line as root.


I did a lot of hacking to test the capabilities of Linux's md RAID. The most
hard-core hack was making 7 sparse files, attaching each of them to a loopback
block device (/dev/loopN), making a RAID5 out of it, and filling it with a
single file that was a pattern of data. Then, I wrote a little C code to go
and modify specific regions of one file; altering the data on the device.
I expected read/write operations on the RAID "array" of files to trigger errors
and rebuild events. It did not. Does not inspire confidence... does it? When
I find some free time I will do some more testing.


From tclug1 at whitleymott.net  Wed Jan  3 20:33:18 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Wed, 3 Jan 2018 20:33:18 -0600
Subject: [tclug-list] raid log?
Message-ID: <CAD+dB9Akro-qPbqvv39OOK+CP_fhSS692BGCZM5PvRidP7PCBw@mail.gmail.com>

>
> > surprised to see nothing in /var/log/messages, it looks like
> > centos6/mdadm don't log such things?
>
> It should be logged in messages...
> <https://serverfault.com/questions/255544/reason-for-automatic-raid-resync>
>
> You can get emails of events with this:
> mdadm --monitor /dev/md0 -m myusername at mailserver
> You can put this in /etc/rc.d/rc.local...


or -y to enable mdadm use of syslog, so apparently in lieu of this being
done, by default nothing is logged, indeed there's no mention in
/var/log/messages, like mdadm philosophy leans toward nagios, expect
nothing until you code in the checks/reports you want, nagios reported:


On Sat, Dec 30, 2017 at 9:34 AM, fo4 nagios wrote:

> ***** Nagios *****
>
> Notification Type: PROBLEM
>
> Service: nrpe mdstat
> Host:
> Address: <http://caper-back.fo4.net>
> State: WARNING
>
> Date/Time: Sat Dec 30 09:34:34 CST 2017
>
> Additional Info:
>
> WARNING - Checked 6 arrays, resync : 2.9%
>


On Sat, Dec 30, 2017 at 11:24 AM, fo4 nagios wrote:

> ***** Nagios *****
>
> Notification Type: PROBLEM
>
> Service: nrpe mdstat
> Host:
> Address:
> State: WARNING
>
> Date/Time: Sat Dec 30 11:24:05 CST 2017
>
> Additional Info:
>
> WARNING - Checked 6 arrays, resync : 97.7%


which doesn't tell me which array(s) re-sunk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180103/5d771b24/attachment.html>

From iznogoud at nobelware.com  Thu Jan  4 10:39:15 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Thu, 4 Jan 2018 16:39:15 +0000
Subject: [tclug-list] raid log?
In-Reply-To: <CAD+dB9Akro-qPbqvv39OOK+CP_fhSS692BGCZM5PvRidP7PCBw@mail.gmail.com>
References: <CAD+dB9Akro-qPbqvv39OOK+CP_fhSS692BGCZM5PvRidP7PCBw@mail.gmail.com>
Message-ID: <20180104163915.GA26612@nobelware.com>

You mean they "re-synched" and nothing sunk, hopefully!

Look in /etc/mdadm.conf for the following:

# When used in --follow (aka --monitor) mode, mdadm needs a
# mail address and/or a program.  This can be given with "mailaddr"
# and "program" lines to that monitoring can be started using
#    mdadm --follow --scan & echo $! > /run/mdadm/mon.pid
# If the lines are not found, mdadm will exit quietly
MAILADDR iznogoud at bigpapa
#PROGRAM /usr/sbin/handle-mdadm-events

The last line is something you can build yourself, but I think the monitoring
capability of mdadm running on its own will suffice. You can put it in
/etc/rc.d/rc.local or similar for your distro to launch at booting.

A personal preference of mine for servers I run is to launch services from
the command-line manually when I boot them (does not happen often). I throw
an mdadm --monitor with an 'at now' command. The idea is that if the monitoring
process fails, I will get an email from the cron-job termination. Again, these
are personal preferences and some admins may disagree with my practices.

I really want to find some time to play with Linux md and experiment with
RAID and networked RAID.

From iznogoud at nobelware.com  Thu Jan  4 10:41:18 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Thu, 4 Jan 2018 16:41:18 +0000
Subject: [tclug-list] raid log?
In-Reply-To: <20180104163915.GA26612@nobelware.com>
References: <CAD+dB9Akro-qPbqvv39OOK+CP_fhSS692BGCZM5PvRidP7PCBw@mail.gmail.com>
 <20180104163915.GA26612@nobelware.com>
Message-ID: <20180104164118.GB26612@nobelware.com>

One more thing. Is "Factor of 4" your new business or an old one? I looked at
the websites you made and I liked the work.


From tclug1 at whitleymott.net  Thu Jan  4 21:02:17 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Thu, 4 Jan 2018 21:02:17 -0600
Subject: [tclug-list] raid log?
Message-ID: <CAD+dB9Cp8BnSO4HVh034Z3BMjQ3SXv6Nw_MGa_5UgRfo2Px4sA@mail.gmail.com>

>
> Look in /etc/mdadm.conf for the following:
>
> # When used in --follow (aka --monitor) mode, mdadm needs a
> # mail address and/or a program.  This can be given with "mailaddr"
> MAILADDR iznogoud at bigpapa


hmm, so i gather the usual mode of mdadm isn't monitor mode.  perhaps that
could explain why mine didn't email me on the 30th when an array did a
re-sync.  or perhaps the array that did a resync isn't listed in the
mdadm.conf here because it was created since the original install.  all
arrays here work, even tho some aren't listed in mdadm.conf, so clearly the
content of mdadm.conf isn't too important for day-to-day operation.  mail
to root should forward to me:

># mdadm.conf written out by anaconda
>MAILADDR root
>AUTO +imsm +1.x -all
>ARRAY /dev/md0 level=raid1 num-devices=2
UUID=bdfc0b00:4d977c92:2037e6c5:95497a22
>ARRAY /dev/md1 level=raid1 num-devices=2
UUID=2d9b0c63:d8a6542b:d18bb064:57c4b05c
>ARRAY /dev/md2 level=raid1 num-devices=2
UUID=79149f44:17af39a9:0e898977:5f8ede87
>ARRAY /dev/md3 level=raid1 num-devices=2
UUID=d353936a:1327da51:aec60b11:7c3a52de

when I boot them (does not happen often)


must be your setup doesn't judge each new released kernel to be relevant
for your site's security.

One more thing. Is "Factor of 4" your new business or an old one? I looked
> at
> the websites you made and I liked the work.
>

i'm just the nerd contracted to shepherd the OS/updates/backups/security,
you can see about the real actors at FactorOf4.net, they are good guys
indeed.

-- 
this concludes test 42 of big bang inflation dynamics.  in the advent of an
actual universe, further instructions will be provided.
000000000000000000000042
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180104/bd2d6e7d/attachment.html>

From tclug1 at whitleymott.net  Thu Jan 11 17:07:37 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Thu, 11 Jan 2018 17:07:37 -0600
Subject: [tclug-list] boot with sda dead?
Message-ID: <CAD+dB9BYEMJxHgFZTY=pwx4K_oyJT5=u3MW0KqqAdY0uzjdc=w@mail.gmail.com>

my setup is sda and sdb are partitioned identically, boot is raid1/ext3,
swap is raid1, root is raid1/lvm/ext4, each raid1 has a partition on sda
and a partition on sdb.

if sda goes south, will centos7 still boot?  it won't work unless the sdb
mbr points to the sdb member of /boot, which means the sdb mbr would not be
an exact copy of the sda mbr.

i kind of expect it to be more simpleminded and less friendly such that it
will only work if sda is removed and sdb becomes addressable as sda.

or perhaps the mbr is smarter than i expect and looks for a UUID, in which
case it would work either way.

has anyone actually tried this in centos7?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180111/a91a4519/attachment.html>

From tclug1 at whitleymott.net  Fri Jan 12 16:59:01 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Fri, 12 Jan 2018 16:59:01 -0600
Subject: [tclug-list] boot with sda dead?
Message-ID: <CAD+dB9AZqD+OxXEeJ1Rh1pPAeM7BO3oGc2O9KkiJaKWu51UEjg@mail.gmail.com>

i just enlarged the sda and sdb partitions underneath /boot, which broke
their raid1 association, and then learned at grub rescue that the
centos7/grub2 mbr indeed looks for the raid device, not just an underlying
partition, so i wonder if sda were dead if perhaps it might still just
work, if all i do is tell the bios to boot sdb.


On Thu, Jan 11, 2018 at 8:10 PM, Nathan O'Brennan <plugaz at codezilla.xyz>
wrote:

> I did try this in Centos 7. My sda failed and I had to do some messing to
> get it to boot as degraded, however it was possible. If I remember
> correctly I had to switch sata cables so sdb looked like sda and I had just
> happened to manually install the boot onto both mbrs.  In the end I put a
> third drive in just for the OS and boot so I could keep my raid volumes
> completely separated.
>
> In a pinch you can also boot using Kali or something similar and manually
> mount the degraded array if you need to pull data off.
>
>
> On 2018-01-11 16:07, gregrwm wrote:
>
> my setup is sda and sdb are partitioned identically, boot is raid1/ext3,
> swap is raid1, root is raid1/lvm/ext4, each raid1 has a partition on sda
> and a partition on sdb.
>
> if sda goes south, will centos7 still boot?  it won't work unless the sdb
> mbr points to the sdb member of /boot, which means the sdb mbr would not be
> an exact copy of the sda mbr.
>
> i kind of expect it to be more simpleminded and less friendly such that it
> will only work if sda is removed and sdb becomes addressable as sda.
>
> or perhaps the mbr is smarter than i expect and looks for a UUID, in which
> case it would work either way.
>
> has anyone actually tried this in centos7?
>
>
-- 
this concludes test 42 of big bang inflation dynamics.  in the advent of an
actual universe, further instructions will be provided.
000000000000000000000042
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180112/7b0ae524/attachment.html>

From iznogoud at nobelware.com  Sun Jan 14 13:48:39 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Sun, 14 Jan 2018 19:48:39 +0000
Subject: [tclug-list] boot with sda dead?
In-Reply-To: <CAD+dB9AZqD+OxXEeJ1Rh1pPAeM7BO3oGc2O9KkiJaKWu51UEjg@mail.gmail.com>
References: <CAD+dB9AZqD+OxXEeJ1Rh1pPAeM7BO3oGc2O9KkiJaKWu51UEjg@mail.gmail.com>
Message-ID: <20180114194839.GA26377@nobelware.com>

This response will not be helpful. (What an ironic start to the year!)

> 
> i just enlarged the sda and sdb partitions underneath /boot, which broke
> their raid1 association, and then learned at grub rescue that the
> centos7/grub2 mbr indeed looks for the raid device, not just an underlying
> partition, so i wonder if sda were dead if perhaps it might still just
> work, if all i do is tell the bios to boot sdb.
>

Regarding "enlarging" there is no breaking of the RAID association. A given
software RAID setup done with Linux "md' (multiple devices) will read the
header block of the partition/file/drive and determine its association. If it
is inconcistent in size, it may reject it. But regardless, you can attach any
type of device to it -that is "greater than or equal to" in size-- and it will
re-build the component it needs to get to a non-degraded state. Of course you
need to be careful and know what you are doing so that you do not have
unreasonable demands of md...

GRUB is very smart. It is like a fully-funcitonal OS. Just like the kernel
can use specific RAID devices using the UUID, so can GRUB. But that does not
mean you should let it decide what you want to have happen... The two-stage
booting procedure of GRUB can be edited as you go; I am not a GRUB expert, so
I keep a lot of notes on using it that I do not remember off hand. As a helper,
I keep README files on the boot (say "/boot") partitions with descriptions,
partition tables, and lots of procedures, complete with GRUB commands, so that
I can bootstrap myself easily if I need to work with failed drives. The idea
is that you can list (ls) and read (cat) files from GRUB, so you can recover
a system without having to have another functioning system with a browser open.
(Years of experience working with failed drives speaks...)


From tclug1 at whitleymott.net  Mon Jan 15 12:00:35 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Mon, 15 Jan 2018 12:00:35 -0600
Subject: [tclug-list] boot with sda dead?
Message-ID: <CAD+dB9CByVdQEj7hCk=pO=d1eaz6AY_3=roS_ES0eXGRcGX5_A@mail.gmail.com>

>
> > i just enlarged the sda and sdb partitions underneath /boot, which broke
> > their raid1 association
>
> Regarding "enlarging" there is no breaking of the RAID association. A given
> software RAID setup done with Linux "md' (multiple devices) will read the
> header block of the partition/file/drive and determine its association.


from what i read now i should have used mdadm --grow.  but i enlarged the
partitions.  then neither grub nor linux could find the array.  the
enlarged partitions were fine but the raid1 array had vanished.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180115/84c16afd/attachment.html>

From iznogoud at nobelware.com  Mon Jan 15 16:02:34 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Mon, 15 Jan 2018 22:02:34 +0000
Subject: [tclug-list] boot with sda dead?
In-Reply-To: <CAD+dB9CByVdQEj7hCk=pO=d1eaz6AY_3=roS_ES0eXGRcGX5_A@mail.gmail.com>
References: <CAD+dB9CByVdQEj7hCk=pO=d1eaz6AY_3=roS_ES0eXGRcGX5_A@mail.gmail.com>
Message-ID: <20180115220234.GA18792@nobelware.com>

> 
> from what i read now i should have used mdadm --grow.  but i enlarged the
> partitions.  then neither grub nor linux could find the array.  the
> enlarged partitions were fine but the raid1 array had vanished.

I see you generally insist on playing the dangerous game of learning on
"production" setups, as in, you try your experiments on hardware with data
that you care about. "Sad." How about you take my earlier suggestion and create
a bunch of software RAID devices with sparse files and loopback devices and
learn how to play with those before actually trying any of this on live data?

Roughly speaking...

1. use "dd" to create sparse files (files that are of a specific size but
have no data on them); look it up on the interwebz

2. associate each file to a virtual block device (loopback device) with
"losetup" and the like (see man pages)

3. build a sandbox array to play with with "mdadm" like this:
mdadm --create --verbose /dev/md0 --level=mirror \
               --raid-devices=2 /dev/loop0 /dev/loop1 \
               --spare-devices=2 /dev/loop3 /dev/loop4

4. make a file system of this... beast and mount it:
'mkfs.xfs /dev/md0'
'mount /dev/md0 /mnt/tmp'
'echo "A file of CRAP" > /mnt/tmp/dummy_file'

5. try to degrade it and rebuild it:
'mdadm /dev/md0 --fail /dev/loop1'
'mdadm /dev/md0 --remove /dev/loop1'
'mdadm /dev/md0 --add /dev/loop1'
(Look in /var/log/... for logged events.)

6. "grow" the array without touching the filesystem

7. grow the (XFS) filesystem to match the array size

Take notes in the process!


From tclug1 at whitleymott.net  Mon Jan 15 18:43:20 2018
From: tclug1 at whitleymott.net (gregrwm)
Date: Mon, 15 Jan 2018 18:43:20 -0600
Subject: [tclug-list] boot with sda dead?
Message-ID: <CAD+dB9CiYsx7o1NxMuWU6sWK6YSpdB_pptVcD+s0caos74t=og@mail.gmail.com>

>
> > from what i read now i should have used mdadm --grow.  but i enlarged the
> > partitions.  then neither grub nor linux could find the array.  the
> > enlarged partitions were fine but the raid1 array had vanished.
>
> I see you generally insist on playing the dangerous game of learning on
> "production" setups, as in, you try your experiments on hardware with data
> that you care about.
>

you make unkind assumptions.  the box at issue already failed.  spurious
reboots.  all data was both fully backed up and migrated to working servers
as a matter of standard op procedure.  i'm about trying to diagnose it.  i
wanted to create a diagnostics partition so the manufacturer diagnostics
could write a progress log that was fetchable even after a spurious
reboot.  if the box died further it would be no real loss, but for both my
convenience and learning i tried to do what was needed without killing the
OS, and succeeded.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20180115/bb8d88b6/attachment.html>

From iznogoud at nobelware.com  Tue Jan 16 07:59:46 2018
From: iznogoud at nobelware.com (Iznogoud)
Date: Tue, 16 Jan 2018 13:59:46 +0000
Subject: [tclug-list] boot with sda dead?
In-Reply-To: <CAD+dB9CiYsx7o1NxMuWU6sWK6YSpdB_pptVcD+s0caos74t=og@mail.gmail.com>
References: <CAD+dB9CiYsx7o1NxMuWU6sWK6YSpdB_pptVcD+s0caos74t=og@mail.gmail.com>
Message-ID: <20180116135946.GA21058@nobelware.com>

> 
> you make unkind assumptions.  the box at issue already failed.  spurious
> reboots.  all data was both fully backed up and migrated to working servers
> as a matter of standard op procedure.  i'm about trying to diagnose it.  i

Ah, got it. I went back to your Jan 11 message where you were asking if anyone
had done blah... I thought this was continuing in that spirit. Sounds like you
got a handle on things now.