[tclug-list] Server shutdown questions

Mon Jun 4 17:55:28 CDT 2018

> If it's a power supply failing or some component overheating, you are
> unlikely to be able to find that in any standard logfile.
> Good luck with your forensics.

Randy nailed it. This is most likely what is going on, especially if it just
started, and after no significant configuration change or software upgrade.

We have servers that drop out of the compute cluster and come back up. We do
not know why...

I will take this opportunity to say that if you are running a server, you want:

1. a UPS that is monitored by the system itself
2. a solid and robust backup scheme
3. some level of driver RAID
4. some external notification that the server is having issues
5. (I recommend) no automatic reboot on failure