Service Disruption: Latency Increase, Mail Delays or Event Delays

Incident Report for SendGrid

Resolved

We have successfully mitigated last week’s service disruption and end-to-end delivery times are back into normal ranges.

Following are details on the root cause:
-Last week we implemented a series of planned networking changes in our data centers ahead of a scheduled network upgrade.
-As part of the planned changes, we shifted our traffic away from the data center where the changes were to take place.
-We have shifted traffic in this manner many times this year without issue.
-This time, the added load to one of our data centers increased sufficiently to uncover a previously unknown scaling bottleneck in the mail sending system.
-This scaling bottleneck has been isolated to a single component within our software system.

Our engineering team rolled out a series of changes on September 1 to mitigate the root cause of the disruption.

Here are the changes we implemented:
-We rolled out load balancing improvements across our data centers.
-These changes led to a substantial improvement in the processing of our traffic.
-All new traffic is now being processed at expected levels of performance.
-We continue to be in a steady state of performance as email volumes have increased throughout the day.
-Previous emails that have been delayed have been sent.

We deeply apologize for the inconveniences experienced. Please contact our support team with any questions: https://support.sendgrid.com/hc/en-us.

Posted Sep 06, 2016 - 07:51 PDT

Update

We’ve continued to closely monitor our load balancing improvements throughout the day. Mail has continued to process at the expected performance level you can trust from our service.

We’ll continue to monitor our systems over the weekend. Our next update can be expected next week as we declare this outage resolved.

Posted Sep 02, 2016 - 15:31 PDT

Update

We are pleased that end-to-end delivery times are back into normal ranges.

Here are the details on the changes we implemented:
- We have rolled out load balancing improvements across our data centers.
- These changes have led to a substantial improvement in the processing of our traffic and all new traffic is being processed at expected levels of performance.
- We continue to be in a steady state of performance as email volumes have increased throughout the day.
- Previous emails that have been delayed have been sent.

We know that email delivery is critical to your business and sincerely apologize for the inconvenience this has caused.

Posted Sep 02, 2016 - 12:36 PDT

Update

The load balancer changes we've implemented have worked, and we are seeing even distribution of load across our data centers. Mail is continuing to dequeue and we expect that backlogged mail will be be dequeued in the next few hours.

Posted Sep 02, 2016 - 09:30 PDT

Update

We’d like to provide an update on the changes we rolled out overnight to address the root cause of the service disruption:

- We rolled out new load balance improvements across our data centers overnight
- These changes have led to a substantial improvement in the processing of our traffic and all new traffic is being processed at expected levels of performance
- Previous emails that have been delayed are dequeuing

The changes we applied look promising and we will continue to update the Status Page throughout the day as sending volume rises to peak levels.

Posted Sep 02, 2016 - 07:31 PDT

Update

Status update from Craig Kaes, SVP of Engineering.

I wanted you to know directly from me that all of us at SendGrid are deeply sorry for the email delivery issues we've been having and for the resulting adverse impact to your business.

I am reaching out to provide a more complete update on the service disruptions you’re experiencing and include more details on the root cause and our next steps. I want to keep you as up to date as possible on the latest details at a more frequent cadence. You can expect more updates as we work around the clock to resolve these issues.

Here are the details on why you’re experiencing these disruptions with our service:
-Last week we implemented a series of planned networking changes in our data centers ahead of a scheduled network upgrade.
-As part of the planned changes, we shifted our traffic away from the data center where the changes were to take place.
-We have shifted traffic in this manner many times this year without issue.
-This time, the added load to one of our data centers increased sufficiently to uncover a previously unknown scaling bottleneck in the mail sending system.
-This scaling bottleneck has been isolated to a single component within our software system.

Next Steps:
-We’re working on a number of solutions to address the scaling bottleneck and we believe we have isolated the problem.
-The solutions are related to the way we balance the load of the email flow across multiple servers and to the way we cache information. We are putting these changes into production over the next 12 hours.
-Independent of these fixes, because of our email volume patterns, you should see improvements starting this evening as the load subsides and the sending system "catches up" or dequeues.
-The next key milestone will be tomorrow morning when volume picks back up again and we are able to assess if the fixes we put into place have worked.
-You can expect an update tomorrow with additional details once we’ve made that assessment.

We have our best engineers working on the problem overnight, and if for some reason the fix we put into place doesn’t work as we expect, we have three more, specific, additional optimizations to test right behind it.

Again, on behalf of SendGrid, I want to apologize for the inconvenience this is causing.

Sincerely,
Craig Kaes
SVP of Engineering

Posted Sep 01, 2016 - 18:35 PDT

Update

We are currently mitigating any outstanding email queues across some of our data centers, some customers may be experiencing delays as we balance the load if incoming email requests.

We apologize for the inconvenience this may cause while we continue to stabilize mail flow. <3 IG

Posted Sep 01, 2016 - 08:04 PDT

Update

Our US West Coast Datacenter is stable at this time. Our Ops team will continue to monitor mail traffic accordingly.

Ensuring your mail is delivered quickly remains our top priority - should there be any changes, additional status updates will be posted here. -SE

Posted Aug 31, 2016 - 20:58 PDT

Update

Our US West Coast Datacenter has been accepting incoming mail successfully since our last update. After additional review, we have taken it offline to ensure stability. Users within this region may see increased latency until it's back online.

Users may still experience mail delays; our Ops team is continuing to closely monitor mail traffic. Ensuring your mail is delivered quickly remains our top priority. We'll update as soon as we have an update. -RC

Posted Aug 31, 2016 - 18:11 PDT

Update

After closely monitoring traffic, we’ve reenabled our US West Coast Datacenter. Users within this region should begin to see normal latency times.

Users may still experience mail delays; our Ops team is continuing to closely monitor mail traffic. Ensuring your mail is delivered quickly remains our top priority. Updates to follow. -RC

Posted Aug 31, 2016 - 16:08 PDT

Update

US West Coast users will continue to see increased latency.

Due to an increase in traffic after our last update, users may experience mail delays for a small amount of traffic while our mail queue normalizes. Our Ops team is continuing to monitor mail traffic, and we’ll have additional updates to follow. -RC

Posted Aug 31, 2016 - 09:26 PDT

Update

Our OPS team managed to bring the email queue size back to normal. Please note that the US West Coast users might still see delays in connecting to our endpoints. -AC

Posted Aug 31, 2016 - 02:57 PDT

Update

Our queues are still decreasing. While the US West Coast customers may still see increased connectivity time, it's possible that some customers may experience a delay between the processed and delivered events. Updates to follow. -AC

Posted Aug 30, 2016 - 22:51 PDT

Update

As of 3:21PM MT, all new mail is being delivered with no delay. US West Coast customers may still see increased latency between them and SendGrid when sending mail, but once it arrives at the data center it will be processed as normal.

Mail sent from 9:42am- 3:21PM MT is still dequeuing from our Midwest Datacenter. -RC

Posted Aug 30, 2016 - 14:32 PDT

Monitoring

As of 2:50PM MT, new mail in the US region should not be affected by delays, but US West Coast latency may still be higher than normal. Delays within our Midwest Datacenter are improving, and we’ll continue to closely monitor until normal.

We’ll continue to provide you status updates as we hear from our Ops team. Thanks for everyone’s patience. -RC

Posted Aug 30, 2016 - 14:13 PDT

Update

Our Ops team is continuing to investigate the root cause of the alerts we received this morning regarding our US West Coast Data Center.

Due to the increased traffic coming to our Midwest Datacenter, mail has begun to queue, causing delays longer than 30 minutes. We are working as quickly as possible to to dequeue this mail. We deeply apologize for the inconvenience. - RC

Posted Aug 30, 2016 - 11:31 PDT

Investigating

Our Ops team began receiving alerts a few moments ago regarding our US West Coast Data Center. In order to mitigate issues, traffic has been moved to our Midwest Data Center. Users in this region may experience increased connection times. Mail that was sent through this Data Center from around 9:45AM MT to 9:52AM MT may experience a delay. We’re working on this with the highest priority and will provide updates as soon as possible. We apologize for the inconvenience. - RC

Posted Aug 30, 2016 - 09:13 PDT