From iznogoud at nobelware.com Tue Jan 2 22:06:36 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Wed, 3 Jan 2018 04:06:36 +0000 Subject: [tclug-list] raid log? In-Reply-To: References: Message-ID: <20180103040636.GA13293@nobelware.com> > > ah, my colleague finally clarified, a resync, not a rebuild, was noticed by > nagios, glad of that, and i know i'd see it as it happens in /proc/mdstat, > but surprised to see nothing in /var/log/messages, it looks like > centos6/mdadm don't log such things? It should be logged in messages. Resync is enough to worry about hardware issues and really just about hardware issues. I had noticed a "correction" down one a RAID6 once. A drive had dropped a sector (remapped it) and that certainly triggered it. A full-blown re-sync I can only think of being worse: https://unix.stackexchange.com/questions/153243/raid-resyncing-automatically I am paranoid enough to trigger resyncs by hand as I had indicated before. But, apparently, packaged system setups do it too: https://serverfault.com/questions/255544/reason-for-automatic-raid-resync You can get emails of events with this: mdadm --monitor /dev/md0 -m myusername at mailserver You can put this in /etc/rc.d/rc.local or similar place, or just run it manually with 'at now' from the command line as root. I did a lot of hacking to test the capabilities of Linux's md RAID. The most hard-core hack was making 7 sparse files, attaching each of them to a loopback block device (/dev/loopN), making a RAID5 out of it, and filling it with a single file that was a pattern of data. Then, I wrote a little C code to go and modify specific regions of one file; altering the data on the device. I expected read/write operations on the RAID "array" of files to trigger errors and rebuild events. It did not. Does not inspire confidence... does it? When I find some free time I will do some more testing. From tclug1 at whitleymott.net Wed Jan 3 20:33:18 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Wed, 3 Jan 2018 20:33:18 -0600 Subject: [tclug-list] raid log? Message-ID: > > > surprised to see nothing in /var/log/messages, it looks like > > centos6/mdadm don't log such things? > > It should be logged in messages... > > > You can get emails of events with this: > mdadm --monitor /dev/md0 -m myusername at mailserver > You can put this in /etc/rc.d/rc.local... or -y to enable mdadm use of syslog, so apparently in lieu of this being done, by default nothing is logged, indeed there's no mention in /var/log/messages, like mdadm philosophy leans toward nagios, expect nothing until you code in the checks/reports you want, nagios reported: On Sat, Dec 30, 2017 at 9:34 AM, fo4 nagios wrote: > ***** Nagios ***** > > Notification Type: PROBLEM > > Service: nrpe mdstat > Host: > Address: > State: WARNING > > Date/Time: Sat Dec 30 09:34:34 CST 2017 > > Additional Info: > > WARNING - Checked 6 arrays, resync : 2.9% > On Sat, Dec 30, 2017 at 11:24 AM, fo4 nagios wrote: > ***** Nagios ***** > > Notification Type: PROBLEM > > Service: nrpe mdstat > Host: > Address: > State: WARNING > > Date/Time: Sat Dec 30 11:24:05 CST 2017 > > Additional Info: > > WARNING - Checked 6 arrays, resync : 97.7% which doesn't tell me which array(s) re-sunk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iznogoud at nobelware.com Thu Jan 4 10:39:15 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Thu, 4 Jan 2018 16:39:15 +0000 Subject: [tclug-list] raid log? In-Reply-To: References: Message-ID: <20180104163915.GA26612@nobelware.com> You mean they "re-synched" and nothing sunk, hopefully! Look in /etc/mdadm.conf for the following: # When used in --follow (aka --monitor) mode, mdadm needs a # mail address and/or a program. This can be given with "mailaddr" # and "program" lines to that monitoring can be started using # mdadm --follow --scan & echo $! > /run/mdadm/mon.pid # If the lines are not found, mdadm will exit quietly MAILADDR iznogoud at bigpapa #PROGRAM /usr/sbin/handle-mdadm-events The last line is something you can build yourself, but I think the monitoring capability of mdadm running on its own will suffice. You can put it in /etc/rc.d/rc.local or similar for your distro to launch at booting. A personal preference of mine for servers I run is to launch services from the command-line manually when I boot them (does not happen often). I throw an mdadm --monitor with an 'at now' command. The idea is that if the monitoring process fails, I will get an email from the cron-job termination. Again, these are personal preferences and some admins may disagree with my practices. I really want to find some time to play with Linux md and experiment with RAID and networked RAID. From iznogoud at nobelware.com Thu Jan 4 10:41:18 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Thu, 4 Jan 2018 16:41:18 +0000 Subject: [tclug-list] raid log? In-Reply-To: <20180104163915.GA26612@nobelware.com> References: <20180104163915.GA26612@nobelware.com> Message-ID: <20180104164118.GB26612@nobelware.com> One more thing. Is "Factor of 4" your new business or an old one? I looked at the websites you made and I liked the work. From tclug1 at whitleymott.net Thu Jan 4 21:02:17 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Thu, 4 Jan 2018 21:02:17 -0600 Subject: [tclug-list] raid log? Message-ID: > > Look in /etc/mdadm.conf for the following: > > # When used in --follow (aka --monitor) mode, mdadm needs a > # mail address and/or a program. This can be given with "mailaddr" > MAILADDR iznogoud at bigpapa hmm, so i gather the usual mode of mdadm isn't monitor mode. perhaps that could explain why mine didn't email me on the 30th when an array did a re-sync. or perhaps the array that did a resync isn't listed in the mdadm.conf here because it was created since the original install. all arrays here work, even tho some aren't listed in mdadm.conf, so clearly the content of mdadm.conf isn't too important for day-to-day operation. mail to root should forward to me: ># mdadm.conf written out by anaconda >MAILADDR root >AUTO +imsm +1.x -all >ARRAY /dev/md0 level=raid1 num-devices=2 UUID=bdfc0b00:4d977c92:2037e6c5:95497a22 >ARRAY /dev/md1 level=raid1 num-devices=2 UUID=2d9b0c63:d8a6542b:d18bb064:57c4b05c >ARRAY /dev/md2 level=raid1 num-devices=2 UUID=79149f44:17af39a9:0e898977:5f8ede87 >ARRAY /dev/md3 level=raid1 num-devices=2 UUID=d353936a:1327da51:aec60b11:7c3a52de when I boot them (does not happen often) must be your setup doesn't judge each new released kernel to be relevant for your site's security. One more thing. Is "Factor of 4" your new business or an old one? I looked > at > the websites you made and I liked the work. > i'm just the nerd contracted to shepherd the OS/updates/backups/security, you can see about the real actors at FactorOf4.net, they are good guys indeed. -- this concludes test 42 of big bang inflation dynamics. in the advent of an actual universe, further instructions will be provided. 000000000000000000000042 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclug1 at whitleymott.net Thu Jan 11 17:07:37 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Thu, 11 Jan 2018 17:07:37 -0600 Subject: [tclug-list] boot with sda dead? Message-ID: my setup is sda and sdb are partitioned identically, boot is raid1/ext3, swap is raid1, root is raid1/lvm/ext4, each raid1 has a partition on sda and a partition on sdb. if sda goes south, will centos7 still boot? it won't work unless the sdb mbr points to the sdb member of /boot, which means the sdb mbr would not be an exact copy of the sda mbr. i kind of expect it to be more simpleminded and less friendly such that it will only work if sda is removed and sdb becomes addressable as sda. or perhaps the mbr is smarter than i expect and looks for a UUID, in which case it would work either way. has anyone actually tried this in centos7? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclug1 at whitleymott.net Fri Jan 12 16:59:01 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Fri, 12 Jan 2018 16:59:01 -0600 Subject: [tclug-list] boot with sda dead? Message-ID: i just enlarged the sda and sdb partitions underneath /boot, which broke their raid1 association, and then learned at grub rescue that the centos7/grub2 mbr indeed looks for the raid device, not just an underlying partition, so i wonder if sda were dead if perhaps it might still just work, if all i do is tell the bios to boot sdb. On Thu, Jan 11, 2018 at 8:10 PM, Nathan O'Brennan wrote: > I did try this in Centos 7. My sda failed and I had to do some messing to > get it to boot as degraded, however it was possible. If I remember > correctly I had to switch sata cables so sdb looked like sda and I had just > happened to manually install the boot onto both mbrs. In the end I put a > third drive in just for the OS and boot so I could keep my raid volumes > completely separated. > > In a pinch you can also boot using Kali or something similar and manually > mount the degraded array if you need to pull data off. > > > On 2018-01-11 16:07, gregrwm wrote: > > my setup is sda and sdb are partitioned identically, boot is raid1/ext3, > swap is raid1, root is raid1/lvm/ext4, each raid1 has a partition on sda > and a partition on sdb. > > if sda goes south, will centos7 still boot? it won't work unless the sdb > mbr points to the sdb member of /boot, which means the sdb mbr would not be > an exact copy of the sda mbr. > > i kind of expect it to be more simpleminded and less friendly such that it > will only work if sda is removed and sdb becomes addressable as sda. > > or perhaps the mbr is smarter than i expect and looks for a UUID, in which > case it would work either way. > > has anyone actually tried this in centos7? > > -- this concludes test 42 of big bang inflation dynamics. in the advent of an actual universe, further instructions will be provided. 000000000000000000000042 -------------- next part -------------- An HTML attachment was scrubbed... URL: From iznogoud at nobelware.com Sun Jan 14 13:48:39 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Sun, 14 Jan 2018 19:48:39 +0000 Subject: [tclug-list] boot with sda dead? In-Reply-To: References: Message-ID: <20180114194839.GA26377@nobelware.com> This response will not be helpful. (What an ironic start to the year!) > > i just enlarged the sda and sdb partitions underneath /boot, which broke > their raid1 association, and then learned at grub rescue that the > centos7/grub2 mbr indeed looks for the raid device, not just an underlying > partition, so i wonder if sda were dead if perhaps it might still just > work, if all i do is tell the bios to boot sdb. > Regarding "enlarging" there is no breaking of the RAID association. A given software RAID setup done with Linux "md' (multiple devices) will read the header block of the partition/file/drive and determine its association. If it is inconcistent in size, it may reject it. But regardless, you can attach any type of device to it -that is "greater than or equal to" in size-- and it will re-build the component it needs to get to a non-degraded state. Of course you need to be careful and know what you are doing so that you do not have unreasonable demands of md... GRUB is very smart. It is like a fully-funcitonal OS. Just like the kernel can use specific RAID devices using the UUID, so can GRUB. But that does not mean you should let it decide what you want to have happen... The two-stage booting procedure of GRUB can be edited as you go; I am not a GRUB expert, so I keep a lot of notes on using it that I do not remember off hand. As a helper, I keep README files on the boot (say "/boot") partitions with descriptions, partition tables, and lots of procedures, complete with GRUB commands, so that I can bootstrap myself easily if I need to work with failed drives. The idea is that you can list (ls) and read (cat) files from GRUB, so you can recover a system without having to have another functioning system with a browser open. (Years of experience working with failed drives speaks...) From tclug1 at whitleymott.net Mon Jan 15 12:00:35 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Mon, 15 Jan 2018 12:00:35 -0600 Subject: [tclug-list] boot with sda dead? Message-ID: > > > i just enlarged the sda and sdb partitions underneath /boot, which broke > > their raid1 association > > Regarding "enlarging" there is no breaking of the RAID association. A given > software RAID setup done with Linux "md' (multiple devices) will read the > header block of the partition/file/drive and determine its association. from what i read now i should have used mdadm --grow. but i enlarged the partitions. then neither grub nor linux could find the array. the enlarged partitions were fine but the raid1 array had vanished. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iznogoud at nobelware.com Mon Jan 15 16:02:34 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Mon, 15 Jan 2018 22:02:34 +0000 Subject: [tclug-list] boot with sda dead? In-Reply-To: References: Message-ID: <20180115220234.GA18792@nobelware.com> > > from what i read now i should have used mdadm --grow. but i enlarged the > partitions. then neither grub nor linux could find the array. the > enlarged partitions were fine but the raid1 array had vanished. I see you generally insist on playing the dangerous game of learning on "production" setups, as in, you try your experiments on hardware with data that you care about. "Sad." How about you take my earlier suggestion and create a bunch of software RAID devices with sparse files and loopback devices and learn how to play with those before actually trying any of this on live data? Roughly speaking... 1. use "dd" to create sparse files (files that are of a specific size but have no data on them); look it up on the interwebz 2. associate each file to a virtual block device (loopback device) with "losetup" and the like (see man pages) 3. build a sandbox array to play with with "mdadm" like this: mdadm --create --verbose /dev/md0 --level=mirror \ --raid-devices=2 /dev/loop0 /dev/loop1 \ --spare-devices=2 /dev/loop3 /dev/loop4 4. make a file system of this... beast and mount it: 'mkfs.xfs /dev/md0' 'mount /dev/md0 /mnt/tmp' 'echo "A file of CRAP" > /mnt/tmp/dummy_file' 5. try to degrade it and rebuild it: 'mdadm /dev/md0 --fail /dev/loop1' 'mdadm /dev/md0 --remove /dev/loop1' 'mdadm /dev/md0 --add /dev/loop1' (Look in /var/log/... for logged events.) 6. "grow" the array without touching the filesystem 7. grow the (XFS) filesystem to match the array size Take notes in the process! From tclug1 at whitleymott.net Mon Jan 15 18:43:20 2018 From: tclug1 at whitleymott.net (gregrwm) Date: Mon, 15 Jan 2018 18:43:20 -0600 Subject: [tclug-list] boot with sda dead? Message-ID: > > > from what i read now i should have used mdadm --grow. but i enlarged the > > partitions. then neither grub nor linux could find the array. the > > enlarged partitions were fine but the raid1 array had vanished. > > I see you generally insist on playing the dangerous game of learning on > "production" setups, as in, you try your experiments on hardware with data > that you care about. > you make unkind assumptions. the box at issue already failed. spurious reboots. all data was both fully backed up and migrated to working servers as a matter of standard op procedure. i'm about trying to diagnose it. i wanted to create a diagnostics partition so the manufacturer diagnostics could write a progress log that was fetchable even after a spurious reboot. if the box died further it would be no real loss, but for both my convenience and learning i tried to do what was needed without killing the OS, and succeeded. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iznogoud at nobelware.com Tue Jan 16 07:59:46 2018 From: iznogoud at nobelware.com (Iznogoud) Date: Tue, 16 Jan 2018 13:59:46 +0000 Subject: [tclug-list] boot with sda dead? In-Reply-To: References: Message-ID: <20180116135946.GA21058@nobelware.com> > > you make unkind assumptions. the box at issue already failed. spurious > reboots. all data was both fully backed up and migrated to working servers > as a matter of standard op procedure. i'm about trying to diagnose it. i Ah, got it. I went back to your Jan 11 message where you were asking if anyone had done blah... I thought this was continuing in that spirit. Sounds like you got a handle on things now.