On Wed, Sep 01, 2010 at 09:21:59AM -0500, Scott Downing wrote:
> This morning we had one of our apache servers (running Ubuntu) went
> haywire, something on it used all the memory making it completely
> inaccessible. It was given a hard reboot and everything on it is back to
> normal. I'm not really sure how I can track down which process is
> responsible for this, In the syslog I see messages like: 
> Out of memory: kill process 30525 (apache2) score 2525474 or a child
> But I'm not sure what process that was nor what to do to figure that out.
> Maybe I can't find the real reason this time but for future crashes of
> this type is there anything I can be doing to collect better information?

It's probably not possible to track down what the process was unless you
have some sort of process accounting running.  The Linux OOMKiller is quite
unsophisticated and the process it kills isn't necessarily using the most
memory, but it's most recent process that attempted to allocate memory that
the kernel can't give it.

I recommend running something like CollectL so that you will have an
accounting log to play back after a crash (of course, in the event of an
OOM occurance, even CollectL may not be able to get enough memory to log,
but at least you have a picture of the minutes before the OOM event to see
what is likely to have used up most of the memory).

http://collectl.sourceforge.net/

-- 
Gabe Turner                                             gabe at msi.umn.edu
HPC Systems Administrator,
University of Minnesota
Supercomputing Institute                          http://www.msi.umn.edu