Today, November the 3rd, 2014 the MaxCDN London Data Center experienced partial outages at approximately 7:12 AM PST and 8:12 AM PST. The incidents were thoroughly investigated by our Network Operations team (NOC) and attributed to an upstream provider experiencing a Core Router failure.
Timeline of Events
At 7:19 AM PST MaxCDN’s monitoring system began alerting the Support and NOC teams of intermittent server response times at our London Data Center (LHR). The issue was immediately escalated to the NOC team. By 7:25 AM PST traffic was re-routed to nearby MaxCDN Points of Presence (PoP) located in Frankfurt and Amsterdam. After 20 minutes of stability the NOC team decided to bring LHR back online in the best interest of preventing traffic overflow at the aforementioned PoPs: Frankfurt and Amsterdam. At 7:45 AM PST, LHR was officially back online and serving traffic. At 8:13 AM PST The upstream provider’s Core Router (CR.LHR1) failed a second time. Moreover, failover between the upstream provider’s primary and secondary routing devices failed to take place. At 8:20 AM PST the NOC team had responded by taking LHR out of routing and provisioning. Despite following standard security protocol in removing our BGP announcement, the upstream provider’s devices continued to announce our IP’s in London. NOC immediately proceeded to contact the upstream provider to remove the BGP announcement of MaxCDN IP’s. At 8:37AM PST, the upstream provider removed our announcement and traffic was able to properly failover to the Frankfurt and Amsterdam PoPs. The Operations team then began to balance European traffic between Amsterdam and Frankfurt. At 11:00 AM PST we received an official confirmation from the upstream provider that their faulty switch was replaced. Subsequently, all traffic was routed back to the London PoP. At 3:36 PM PST the MaxCDN monitoring system alerted the Support and NOC teams that the prior issue with our upstream provider had returned. The NOC team then successfully initiated a re-routing of traffic around the affected PoP (LHR) to ensure traffic stability until the issue has been fully addressed by the upstream provider.
What we are doing to prevent this in the future
We take great pride in the reliability and performance MaxCDN offers. Moreover, we value a capacity to constantly improve our services. To deter outages like this in the future we have been finalizing an upstream provider transition road map:
Recurrent patterns of sub adequate performance from one of our upstream providers have led us to revisit our available options. To increase our redundancy, thereby minimizing the impact of such incidents, we are in the process of deploying 2 more European PoPs in the near future. We are continuing to reinforce our failover protocol which enabled our NOC and Support teams in responding quickly throughout today’s incidents.
For additional updates regarding this incident, please visit status.maxcdn.com.
Should you have any further questions, please do not hesitate to contact the support team directly at : email@example.com