Debugging load spikes. - Tina Marie's Ramblings
Debugging load spikes.
(alioth1, I'm particularly looking at you)

Over the last few weeks, I've been seeing weird lag spikes on my RHEL4 box. My load average spends most of it's time around .1-.2, but a few times lately I've seen sendmail shut down mail processing because it spiked as high as 14 for a minute or two. The times are mostly random - 7:40am, 12:02am - and don't seem to correlate with anything else that I can find in my logiles.

I did a ton of reading last night that basically just said to bump up the sendmail limit, becuase linux calculates load average differently, and sendmail can happily survive at higher load averages. So I bumped it up to 18, and added some other sendmail-efficiency configuration options I was missing.

So sendmail's happy now. But I'm still curious what's spiking my LA. top -b -n 10 can be routed to a file to capture what is going on at a given time. I suppose I need a deamon that just watches the load average and fires off top to a logfile? Is there an easier way?

From: ptomblin_lj Date: September 30th, 2007 11:19 pm (UTC) (Link)
Usually a load average spike like that indicates something that's thrashing your disks very hard so all your processes go I/O bound. Sometimes it's something that's chewing up all the memory and causing swapping.

skywhisperer From: skywhisperer Date: October 1st, 2007 12:35 am (UTC) (Link)
That's what I was thinking. Some of the reading I did suggested it might be an early sign of a failing hard drive, and I'd like to know about that sooner rather then later.

Backups or not, I'd be very unhappy if I had to bring that box back up from scratch.
From: ptomblin_lj Date: October 1st, 2007 12:50 am (UTC) (Link)
Are you running smartd on it?
skywhisperer From: skywhisperer Date: October 1st, 2007 04:15 am (UTC) (Link)
Oh, yes!

I used to keep my home linux box (not the one having this problem - it's in a datacenter) in a closet. A closet with airflow, but a closet. One day it got a little warm, and one of the older hard drives died. It sent me more email then you could imagine. Trouble was, I was at work, and by the time I got home it was dead.

But I'm a firm believer in smartd now.
alioth1 From: alioth1 Date: October 1st, 2007 06:43 pm (UTC) (Link)
I've had some humungous spam storms recently - and when exim and spamd do the tango, all while a dozen users (who don't like clearing out their inboxes often) hit the IMAP server, it can send the load average rocketing.

If it's not a failing hard disc, you can always write a perl script or shell script that periodically reads /proc/loadavg and when the first value hits a certain threshold, run 'ps auxww' and record the output to a file.

Anything interesting in the dmesg or /var/log/messages?
skywhisperer From: skywhisperer Date: October 2nd, 2007 06:44 pm (UTC) (Link)
Nothing interesting in the messages. But yesterday I found "monit", and after a few bumps getting it setup and configured, it now sends email to my cell phone when things start getting busy. So far, I've not managed to be somewhere where I could look at things when it happens, but I have tracked down a few processes that don't need to be around (webstats do not need updated once an hour, for example).
