How twitter crashed our server (mongrel/rails)
Posted in Bryan's Blog at 07:53PM on 03/21/2008

This was entirely my fault and just a dumb programming mistake on my part, BUT it opened my eyes to a few weird things to look for while debugging. Short version: 1. Look for stuck/dead mongrels 2. check log sizes 3. check for stuck processes that are interfering with mongrels (lsof -i -n -P | less).

At 5:20, I got an email from one of our web girls saying that the sites were starting to act flakey. After browsing through the sites a bit, I noticed what she was talking about. Some pages loaded instantly, some seemed to take a lot longer... it seemed entirely random, which pointed to symptoms of our log files getting way too big. Last time this happened, I was able to quickly rotate the logs, bounce the mongrels, and restart apache. This time, I did the same process and noticed some initial improvement, but not a complete resolution.

So we limped along while I got ahold of our system admin and had him take a look. He noticed a pair of curl processes that had been running for the previous two hours. While he was looking there, I noticed that one of our mongrel clusters that is supposed to have 9 mongrels running was down to 7, and when I tried bringing one back, we got an address in use error. ps -aux showed that the mongrel wasn't running on that port, so something else had to be clogging it up. It turns out, curl was (for some reason) running on that port, blocking the mongrels from starting correctly. Pretty freakin' weird, eh?

Here's the line that caused the problem:

system("curl -u krvn:supersecretpassword -d status=\"#{title} #{tinyurl}\" http://twitter.com/statuses/update.xml")

Yep... I had been testing that line to post our headlines to twitter (http://twitter.com/krvn). Apparently, the Twitter server got flakey and my curl process didn't time out. It never occurred to me that this could be a problem. So, I modified the line for now:

system("curl -u krvn:supersecretpassword -d status=\"#{title} #{tinyurl}\" http://twitter.com/statuses/update.xml --connect-timeout 30 --max-time 30")

I'm looking for a better way to monitor the processes and check for this kind of thing. Normally, we are only down for a minute or two while software updates are installed, and we try to do that in the dead of night when no one is online.

What is your mongrel debugging process? How do you guys rotate your logs?
I'm looking for better ways to go about this. Please post your thoughts!

Comments
(will not be displayed)
skippidydoo