[TCLUG] High Volume Mail Relay

Fri Jan 11 11:50:37 CST 2002

[Sorry if this is a duplicate ... My reply early this morning was sent from
 the "wrong" address.... -Scott]

Heh.  As a former employee of Sendmail and someone who's done a lot of
research an implementation on high-volume mail servers, I'd have a lot
to say/type on the matter ... except for that darned tendinitis
recovery problem.

>>>>> "mb" == Michael Burns <sextus at visi.com> writes:

mb> Anything, in my experience, is better than sendmail. qmail and
mb> postfix are two.

Er, Sendmail 8.11 and 8.12 has a "multiple queue directory" feature
that hasn't received much publicity but can rival Postfix's speed
despite still being a fork()-happy resource pig.  :-) Back in May
2001, Sendmail's message delivery record on a *single* machine in
*real* Internet conditions (i.e. not in a perfect lab environment) was
2.98 million messages/hour.  That's about 825 messages/second.
Average message size was 4.9KB.

Rob Kolstad had a USENIX paper a few years ago about tuning Sendmail.
Much of what it talks about is still valid, despite changing times and
MTAs.  I can't think of other public email tuning resources off the
top of my head, though they do exist if you dig.

For high volume delivery, touching disk == performance death.  Period.
So, avoid touching disk if you can.  Most machines don't have
non-volitile memory, so pure solid state disk or hybrid SSDs with hard
disk backing store (and small batteries to keep the drive alive long
enough to flush cache memory contents to that disk) are the way to go.

But wait!  Those things are expensive!  Yup.  But what if I need to
queue more stuff than that?  Then don't store it on SSD: move it
elsewhere.  Sendmail has a "fallback host" feature: if delivery fails
on the first attempt, forward the message to the fallback.  The
fallback has a good disk system for delivery queue storage, but
doesn't need SSD.  You *know* (or **hope**) most of your messages are
delivered on the first attempt, and you *know* this small fraction
remaining are going to have to wait, so you tune the fallback hardware
much differently than your first-attempt server(s).

I don't know if other MTAs have the fallback feature, but for high
volume outgoing delivery, it's wonderful.

Gotta run, but, er, my $0.02 on other advice I've read on this list:

* File system choice makes a huge difference.  You must decide how
important it is to recover queued messages in event of an OS crash or
hardware failure.  Friends don't let friends who care about data
integrity use ext2fs.  (See Dug Song's comments on this in a recent
/.-publicized interview.)  GFS, spiffy as it is, won't give you the
file ops you need for 825 msgs/second.  Softupdates (which *are* quite
data safe when using fsync(), as paraniod app writers have to be) or
Veritas's VxFS + SSD or random-disk-I/O-designed-and-tuned RAID is
quite marvelous.

* RAID: You're I/O bound due to random disk activity, reads & writes,
not bulk data throughput.  If you have a 500GB RAID array for
queueing, you have too much disk space, but you can't help that
because you can't buy 2GB disks anymore.  Striping gives spreads the
seek activity the most.  Mirror on top of that if you care about crash
recovery.  If you use parity, you deserve what you get.  If you take
that array and stripe it over 8 60B drives, what a deal.  But (I'm
exaggerating here to make a point!) you're better off striping it
across 60 8GB drives.

* 15K RPM drives won't help nearly as much as lots of slower spinning
drives will, and you won't have to worry about your machine room
catching fire.

* Your drives shouldn't be IDE drives unless you want to deserve what
you get.

* Most people don't pay attention to their SMTP server's DNS servers.
Silly people!  How on earth to you expect to figure huge volumes of
email without being able to resolve high volumes of DNS records?  And
cache that info damn well & quickly, despite whatever efforts your MTA
makes?  Silly, silly....

* Don't bother configuring your MTA (or the servers they run on) to
use or provide IDENT protocol services (RFC 1413).

* Mosix's process migration probably won't help because most MTAs fork
processes often, and they're short-lived.  The resources & time you
spend migrating the process can be much more than it's worth.

* Most file systems have some sort of structure similar to FFS's
"cylinder group".  Configure the stripe width on your RAID subsystem
such that all cylinder groups fall across *all* drives.  It's amazing
how often sysadmins screw this up.  "Hey, Dave, why does the
blinkenlight on that one RAID member blink so much more often than any
of the others?"

* Your boxes have room for more RAM?  And you haven't bought more yet?
What are you thinking?  Each forked process, open file descriptor,
buffered disk page, socket, pipe, DNS cache entry, ad infinitum takes
space, right?

- -Scott

------- End of Forwarded Message