RSS
 

Microsoft memories… Bedlam DL3

Oct 02 2008

Microsoft memories... Bedlam DL3

I started working at Microsoft in 1996. Being there in the 90′s was exciting! It was especially interesting doing QA work on so many internet products between 1996 and 2002. I guess I feel a little bit ‘superior’ having been there during the "dot-com" boom and while the internet (for the public) was so ‘young’.

To start off this memory, you have to know the difference between “reply” (‘little r’) and "Reply All" (‘big R’). ‘Little r’ means to only reply to the person who sent the email, even if there were 20 other people on the email thread. ‘Big R’ means to reply to EVERYONE on the thread. You can imagine what would happen if EVERYONE hit "Reply All" to every message sent to a large email to many people.

In the 90′s at Microsoft, if someone did a ‘big R’ when they didn’t need to, someone usually made a reference to "Bedlam". You could usually tell who’d been around for a while based on who laughed or smiled.

"Bedlam" was one specific day at Microsoft when (accidentally) 20,000+ email accounts and aliases got the same email message, at the same time.  That wouldn’t have been so bad, except that many hundreds (if not thousands) of morons replied to all and said something like "take me off this mailing list." Which of course went to… 20,000+ email accounts and aliases. To which, many hundreds (if not thousands) of morons replied to all… and so on.    I recall the weekly company newsletter saying that on a normal business day that MS handled around 4 million email messages. And on the day of "Bedlam" that over 15 million messages were handled in one HOUR.

 

Here’s a version right from the Microsoft TechNet Exchange blog:

==================
Well, Microsoft’s a pretty big organization.  We’ve got well over 100,000 mailboxes in our email infrastructure, and at times it can become rather cumbersome to manage all these.  One of the developers in our Internal Technologies Group (also known as ITG, basically the MIS department at Microsoft) was working on a new tool to manage communications with the various employees at Microsoft, and as a part of this tool, he created several distribution lists.  Each distribution list had about a quarter of the mailboxes in the company on it (so there were about 13,000 mailboxes on each list).  For whatever reason, the distribution lists were named “Bedlam DL<n>” (maybe the tool was named Bedlam?  I’m not totally sure).

Well the name of the lists certainly proved prophetic.

It all started one morning when someone looked at the list of DL’s they were on, and discovered that they were on this mysterious distribution list called “Bedlam DL3”.  So they did what every person should do in that circumstance (not!).

They sent the following email:

To:   Bedlam DL3
From: <User>
Subject: Why am I on this mailing list?  Please remove me from it.

Remember, there are 25,000 people on this mailing list.  So all of a sudden, all 25,000 people received the message.  And almost to a person, they used the “reply-all” command and sent:

To:   Bedlam DL3
From: <User>
Subject: RE: Why am I on this mailing list?  Please remove me from it.
Me too!

In addition, there were some really helpful people on the mailing list too:  They didn’t respond with just “Me Too!”  They responded with:

To:   Bedlam DL3
From: <User>
Subject: RE: Why am I on this mailing list?  Please remove me from it.
Stop using reply-all – it bogs down the email system.

You know what?  They were right – the company’s email system did NOT deal with this gracefully.

Why?  Well, you’ve got to know a bit more about how Exchange works internally.

First off, the original mail went to 13,000 users.  Assuming that 1,000 of those 13,000 users replied, that means that there are 1,000 replies being sent to those 13,000 users.  And it turns out that a number of these people had their email client set to request read receipts and delivery receipts.  Each read and delivery receipt causes ANOTHER email to be sent from the recipient back to the sender (all 13,000 recipients).  Assuming that 20% of the 1,000 users replying had read receipts or delivery receipts set, that meant that every one of the message that they sent caused another message to be sent for every one of the 13,000 recipients. So how many messages were sent?

First there were the basic messages – that’s 13,000,000 messages.
Next there were the receipts – 200 users, 13,000 receipts – that’s and additional 2,600,000 messages.
So about 15.5 MILLION messages were sent through the system.  In about an hour.

So at a minimum, 15,600,000 email messages will be delivered into peoples mailboxes.  But Exchange can handle 15,600,000 email messages EASILY.  There’s another problem that’s somewhat deeper.

An Exchange email message actually has TWO recipient lists – there’s the recipient list that the user sees in the To: line on their email message. This is called the P2 recipient list. This is the recipient list that the user typed in. There’s also a SECOND recipient list, called the P1 recipient list that contains the list of ACTUAL recipients of the message. The P1 recipient list is totally hidden from the user, it’s used by the MTA to route email messages to the correct destination server.

Internally, the P1 list is kept as the original recipient list, plus all of the users on the destination servers.  As a result, the P1 list is significantly larger than the P2 list.

For the sake of argument, let’s assume that 10% of the recipients on each message (130) are on each server. So each message had 100 recipients in the P1 header, plus the original DL. Assuming 100 bytes per recipient email address, this bloats each email message by 13K. And this assumes that there are 0 bytes in the message – just the headers involve 13K.

So those 15,000,000 email messages collectively consumed 195,000,000,000 bytes of bandwidth. Yes, 195 gigabytes of bandwidth bouncing around between the email servers.

Compounding this problem was a bug in the MTA that caused the MTA to crash that occurred only when it received a message with more than 8,000 recipients. But it crashed only AFTER processing up to 8,000 recipients. So 8,000 of the 13,000 recipients of the message would get it and 5,000 wouldn’t. When the MTA was restarted, it would immediately start processing the messages in its queue – and since the messages hadn’t been delivered yet, it would retry to deliver the message, sending to the SAME 8,000 recipients and crashing. And because of the way the Exchange store interacts with the MTA, even if we shut down the MTA, the messages would still queue up waiting on delivery to the MTA –shutting down the MTA wouldn’t fix the problem, it would only defer the problem (since the message store would immediately start delivering the queued messages into the MTA the second the MTA came back up).

So what did we do to fix it? Well, the first thing that we did was to fix the MTA. And we tried to scrub the MTA’s message queues. This helped a lot, but there were still millions of copies of this message floating around the system.

It took about 2 days of constant work before the email system recovered from this one. When it was over, the team firefighting the crisis had t-shirts made with “I survived Bedlam DL3” on the front and “Me Too! (followed by the email addresses of everyone who had replied)” on the back.

To prevent anything like this happening in the future, we added a message recipient limit to Exchange – the server now has the ability to enforce a site-wide limit on the number of recipients in a single email message, which neatly prevents this from being a problem in the future.

Larry Osterman

- from Me Too!
==================

 
 
QR Code Business Card