Uplink Provider Issues in London
Incident Report for MaxCDN
Postmortem

Incident

Today, November the 3rd, 2014 the MaxCDN London Data Center experienced partial outages at approximately 7:12 AM PST and 8:12 AM PST. The incidents were thoroughly investigated by our Network Operations team (NOC) and attributed to an upstream provider experiencing a Core Router failure.

Timeline of Events

At 7:19 AM PST MaxCDN’s monitoring system began alerting the Support and NOC teams of intermittent server response times at our London Data Center (LHR). The issue was immediately escalated to the NOC team. By 7:25 AM PST traffic was re-routed to nearby MaxCDN Points of Presence (PoP) located in Frankfurt and Amsterdam. After 20 minutes of stability the NOC team decided to bring LHR back online in the best interest of preventing traffic overflow at the aforementioned PoPs: Frankfurt and Amsterdam. At 7:45 AM PST, LHR was officially back online and serving traffic. At 8:13 AM PST The upstream provider’s Core Router (CR.LHR1) failed a second time. Moreover, failover between the upstream provider’s primary and secondary routing devices failed to take place. At 8:20 AM PST the NOC team had responded by taking LHR out of routing and provisioning. Despite following standard security protocol in removing our BGP announcement, the upstream provider’s devices continued to announce our IP’s in London. NOC immediately proceeded to contact the upstream provider to remove the BGP announcement of MaxCDN IP’s. At 8:37AM PST, the upstream provider removed our announcement and traffic was able to properly failover to the Frankfurt and Amsterdam PoPs. The Operations team then began to balance European traffic between Amsterdam and Frankfurt. At 11:00 AM PST we received an official confirmation from the upstream provider that their faulty switch was replaced. Subsequently, all traffic was routed back to the London PoP. At 3:36 PM PST the MaxCDN monitoring system alerted the Support and NOC teams that the prior issue with our upstream provider had returned. The NOC team then successfully initiated a re-routing of traffic around the affected PoP (LHR) to ensure traffic stability until the issue has been fully addressed by the upstream provider.

What we are doing to prevent this in the future

We take great pride in the reliability and performance MaxCDN offers. Moreover, we value a capacity to constantly improve our services. To deter outages like this in the future we have been finalizing an upstream provider transition road map:

Recurrent patterns of sub adequate performance from one of our upstream providers have led us to revisit our available options. To increase our redundancy, thereby minimizing the impact of such incidents, we are in the process of deploying 2 more European PoPs in the near future. We are continuing to reinforce our failover protocol which enabled our NOC and Support teams in responding quickly throughout today’s incidents.

For additional updates regarding this incident, please visit status.maxcdn.com.

Should you have any further questions, please do not hesitate to contact the support team directly at : support@maxcdn.com

Posted over 4 years ago. Nov 04, 2014 - 05:14 UTC

Resolved
Service is up and has been stable for several hours. We are continuing to work with our provider and monitor service for any further problems. If you are still experiencing issues please contact support at support@maxcdn.com. We will be providing a post-mortem to accompany this incident within the next 24 hours. We greatly appreciate your patience and apologize for any issues this may have caused.
Posted over 4 years ago. Nov 03, 2014 - 19:43 UTC
Identified
Our upstream provider has identified a failure with their Core Router CR.LHR1. Failover between the primary and secondary routing device did not take place. After a manual failover by our upstream provider we were able to pull our announcements and divert traffic to nearby locations. We are continuing to investigate this issue further. If you are experiencing any issues please contact our support team at support@maxcdn.com. We apologize for any inconvenience we have caused you and greatly appreciate your patience regarding this matter.
Posted over 4 years ago. Nov 03, 2014 - 17:13 UTC
Investigating
We are currently experiencing networking issues with our uplink provider in London. This is related to the previous outage which occurred earlier this morning.

This is currently affecting our edges at this location. All traffic has been re-routed to nearby locations.

We will provide more information about network health, as soon as it is available.
Posted over 4 years ago. Nov 03, 2014 - 16:26 UTC