Article

Anatomy of a Certificate Outage [Epic Games]

By Scott Carter

Posted on June 4, 2021

Topics

Everyone knows that certificate outages are painful. Just ask anyone who has had to deal with the tangled aftermath of an expired certificate. There are so many unknowns. And so many unanticipated consequences. And that’s perhaps why, when it comes to measuring the specifics of just how bad a given outage was, the details often get blurred by the post traumatic stress. So it’s hard to get answers that quantify the impact. How long was the outage? Too long. How many systems were impacted? Too many. How much revenue was lost? Too much. But that particular type of denial won’t help anyone avoid a similar outage from happening again at some point in the future.

That’s why it’s so amazing that Epic Games was entirely transparent about a certificate outage that impacted the company on April 6. In the spirit of openness and goodwill, the company shared their outage story with the world. In their own words, “It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems.”

The company goes on to reveal in-depth details about why the outage happened, how big was the impact, and how long it took to fix. This is incredibly valuable information to help organizations everywhere understand why they need to take certificate management seriously. This level of sharing is downright…well…epic! And I applaud Epic Games for this heroic level of candor and downright altruism.

It’s bad enough when one system goes down. But what you will see in the story that Epic Games shares is that certificate outages often have unanticipated, critical impact on systems beyond those directly involved in the original outage. Epic Games outlines two additional areas of substantial impact beyond the initial outage triggered by the expired certificate:

  1. An expired certificate caused an outage across a large portion of internal back-end service-to-service calls and internal management tools
  2. Unexpected, significant increases of traffic to the Epic Games Launcher, disrupted service for the Epic Games Launcher and content distribution features
  3. An incorrect version of the Epic Games Store website referencing invalid artifacts and assets was deployed as part of automatic scaling, degrading the Epic Games Store experience

It’s hard to imagine a more careful complete summary of the impacts of certificate outages. Many companies choose to overlook the peripheral impacts. In this case, over 25 critical staff members were pulled away from other pressing duties to repair the damage. Millions of connections were disrupted. And thousands (not quantified) of frustrated customers were offered invalid content from the company’s online store. This brings concrete meaning to otherwise vague terms like lost revenue, diverted productivity, customer dissatisfaction and brand damage.

But the relatively mild user irritation caused by a few minutes of outage did not dissipate once the expired certificate was repaired. As I suspect is often the case, the impact lasted much longer than anyone could have predicted. While the expired certificate was detected and replaced in a near record time (approximately 37 minutes), the aftermath lingered on for nearly 5 hours afterwards. Here’s the exact timeline that Epic Games shared:

  • 12:00PM UTC – Internal certificate expired
  • 12:06PM UTC – Incident reported and incident management started
  • 12:15PM UTC – First customer messaging prepared
  • 12:21PM UTC – Confirmation of multiple large service failures by multiple teams
  • 12:25PM UTC – Confirmation the the certificate reissue process has started
  • 12:37PM UTC – Certificate is confirmed to be reissued
  • 12:46PM UTC – Confirmed recovery of some services
  • 12:54PM UTC – Connection Tracking discovered as an issue for Epic Games Launcher service
  • 1:41PM UTC – Epic Games Launcher service nodes restarted
  • 3:05PM UTC – Connection Tracking limits increased for Epic Games Launcher service
  • 3:12PM UTC – First signs of recovery of Epic Games Launcher service
  • 3:34PM UTC – Epic Games Store web service scales up
  • 3:59PM UTC – First reports of missing assets on Epic Games Store
  • 4:57PM UTC – Issue with mismatched versions of Epic Games Store web service discovered
  • 5:22PM UTC – Epic Games Store web service version corrected
  • 5:35PM UTC – Full recovery

Now that is an afternoon that I would not wish on anyone. But congratulations on a successful resolution. So, how can you be sure that this won’t happen to your organization? First, as Epic Games now does, you need to recognize the critical importance of each and every digital certificate that acts as a machine identity anywhere in your network. You need to know how many you have, where they are being used, and…yes…when they will expire. Once you are armed with that information, you can safely automate the entire certificate lifecycle so that there will be no nasty surprises.

Tired of worrying when your next certificate outage will hit? Find your certificates in minutes with Venafi as a Service.

This site uses cookies to offer you a better experience. If you do not want us to use cookies, please update your browser settings accordingly.
Find out more on how we use cookies.