On Mon, Jun 4, 2018 at 5:55 PM, Iznogoud <iznogoud at nobelware.com> wrote: >> If it's a power supply failing or some component overheating, you are >> unlikely to be able to find that in any standard logfile. >> Good luck with your forensics. > > Randy nailed it. This is most likely what is going on, especially if it just > started, and after no significant configuration change or software upgrade. > > We have servers that drop out of the compute cluster and come back up. We do > not know why... > > I will take this opportunity to say that if you are running a server, you want: > > 1. a UPS that is monitored by the system itself > 2. a solid and robust backup scheme > 3. some level of driver RAID > 4. some external notification that the server is having issues > 5. (I recommend) no automatic reboot on failure > Greetings The issue is, imo anyway, misbehaving software. Have a UPS and working on connecting the much larger one that I also have. What are you recommending for backup? I have an early model blu-ray writer and even 25GB of storage doesn't go very far! Drives are on Raid-10. External notification would be nice but I don't want even more 'noises in the night' so that won't likely be happening. May just reset this some time but I don't reboot the server often so it may have to wait for a bit. Thanks for the ideas all. Dee