October 17th game updates outage postmortem

Last Wednesday we had an outage with downloading game updates. This post is a postmortem about what happened and also provides background on how game distribution works currently in BAR.

author

[T0]Marek

Back to Microblog overview

Last updated:

October 31, 2023

October 17th game updates outage postmortem

Last Wednesday we had an outage with downloading game updates. This post is a postmortem about what happened and also provides background on how game distribution works currently in BAR.

Summary

On Wednesday 17th October, for a total duration of ~2h, we had an outage of game updates serving infrastructure which prevented players from downloading versions of the game needed to play in multiplayer matches. For additional 3 days, changes from the source code repository were not being propagated to players.

The root cause of the issue was a bug introduced by our content delivery network (CDN) provider BunnyCDN in the evaluation of the Edge Rules that we use for redirecting traffic between URLs. The broken URL redirect caused serving of a stale version of game versions index file, old as far as April 2023, which caused lobby to be unable to find and download newer versions of the game requested by the game servers.

Detailed explanation

Background

Rapid

Each player in a match must have exactly the same version of the game. Developers submit changes multiple times a day, and they are automatically pushed to all players. It would be very inefficient if after every single change, the whole game would have to be redownloaded. In addition, different lobbies might be using different versions of the game, so multiple versions of the game need to be installed at the same time.

This problem is solved by using content-addressed storage format. Each file in the game repository is hashed and stored compressed in the pool/ directory by its hash. Then a list of all the files with their hashes in a given game version is put into a file, hashed, and written compressed into packages/{hash}.sdp. To figure out which sdp file corresponds to which game version an versions index file versions.gz is created that maps from game version to the sdp hash. It’s the freshness of this versions index file that caused the issue during this outage.

CDN

Propagating a new version of the game from the source repository is a multi step process:

The build server pulls code from the main game repo periodically. It uses RapidTools to generate new pool files for change files, a new sdp file for the game version, and updates the versions index file.

It is directly from this server that players are fetching updates when selecting the “no-CDN” option in the launcher.
Periodically we synchronize the versions from the build server to the BunnyCDN SSD Edge Storage main region in Germany using a bespoke syncer.
BunnyCDN in the background asynchronously replicates the data to other edge storage regions in New York, Los Angeles and Singapore.
Periodically we use a script to copy the versions.gz to fresh/versions_YYYYMMDDTHHMMSS.gz in the main edge storage region and update Edge Rule to redirect from https://repos-cdn.beyondallreason.dev/byar/versions.gz to the newly created https://repos-cdn.beyondallreason.dev/byar/fresh/versions_YYYYMMDDTHHMMSS.gz.

The edge rule forwards the traffic to the fresh file only when the User-Agent header doesn’t contain latestreplicated substring in it. Script uses that for fetching the real versions.gz and it runs in Germany to make sure it’s contacting the main storage region.
BunnyCDN quickly (within seconds) propagates the Edge Rule update to all the CDN edge servers.
Users and BAR game servers contact the network of tens of CDN edge servers to fetch files. Edge servers apply edge rules and fetch files from cache or the closest edge storage region. If the file is not present in the closest edge storage region, they do a second lookup in the main storage region.

Cache in the edge servers is configured to expire only after 1 year because almost all the files are immutable. There is an additional edge rule configured that overrides the cache expiration for versions.gz and a few other files to 0.

Why Edge Rule at all?

The issue is that replication of data in step 3 is asynchronous and can lag for an unknown duration of time. Since a few months I have a dedicated prober that measures the replication lag, and within just last 14d, Singapore region lag once got up to 39h and just today Los Angeles lagged for 1h. That’s bad enough that we can’t ignore it for our needs.

The mitigation is to use the edge rule, whose replication is very quick, and depend on the behavior that CDN edge servers do a fallback to the main storage region when file is not found in the closest one.

Timeline

Oct 17th (times in UTC)

18:50 - Outage begins with edge rules redirect stops working
18:57 - First report in #main about issue
19:02 - I confirm looking at prober that there is some issue with the CDN
19:14 - Workaround published in #main that temporarily players can use no-CDN launcher option to mitigate
19:40 - I finish investigating the root cause of the issue
19:48 - Opened support ticket with CDN support documenting my findings
19:51 - Reply from CDN that I can try purging the affected file patch from CDN edge cache
19:54 - I purge the file cache and CDN starts serving a fresher file version even without a redirect present. That mitigates the issue.
A bunch of commits land in git repository and are propagated to CDN
21:18 - Edge rule stops working in the opposite direction, outage begins again. 100% of traffic is redirected, but because it’s 100%, automation that updates the redirect doesn’t work correctly and users are pointed at the ~1h old version of the index file.
21:27 - First mention of issue again in #main (I’m under the shower ¯\_(ツ)_/¯)
21:54 - I start working on the issue again
22:09 - I inform support that now it’s broken in the opposite direction
22:13 - I manually update the edge rule to newer version which mitigates the issue
22:55 - Support team passes the issue to the development team

Oct 20

The permanent fix on the CDN side rolls out and I’m able to restart propagation of updates from game repo to CDN.

Root cause

With all that background, we can summarize the root cause of the incident.

For the first portion of the incident the redirect edge rule from step 4 just completely stopped working and users were downloading the original versions.gz file. For some reason also the edge rule described in step 6 that overrides the cache expiration to 0 didn’t work and users were getting a version that is multiple months old. This was mitigated manually, via the BunnyCDN control panel, pruning the versions.gz from cache.

After some configuration change by support, the second portion of the incident started where the redirect started working for 100% of the traffic, ignoring the User-Agent header used by the script from step 4, and breaking its functionality as it was not able to determine that something needs to be copied. This caused the redirect to point at a few commits older versions.gz file, causing the issue on some game servers.

Resolution

The final manual mitigation was to:

Purge cache once more
Manually specify the version of file to point at in the edge rule
Disable the CDN replication automation

Then we waited 3 days for BunnyCDN to release a software update that fixed that bug and re-enabled automated pushes after that.

Lessons learned

Things that went well

BunnyCDN support proves yet again to be dependable with a very short response time.
I finished and deployed the prober and monitoring in the beginning of September and it was very useful for quickly estimating impact and helping investigation.

Things that went badly

Detection of the outage was by players, we don’t have any automated alerting at the moment.
BunnyCDN rolled out a change that completely broke the edge rule used by us, and the change was applied globally instantly.

Where we got lucky

I was around and able to investigate and work on the incident almost immediately.
Some players on their own quickly came up with the workaround to use “no-CDN” option in the launcher which reduced the impact. I will make sure that they are not on that option permanently, because it has negative consequences for our infrastructure, but it was a good workaround.

What’s next?

It would be great to have more contributors interested in this area so that when things go wrong we have better coverage timezone wise and don’t need to depend on luck. Having a better monitoring and alerting setup for things in BAR infra would also allow us to make the players experience better.

We currently don’t plan to stop using BunnyCDN as it still fits the best for what we need. Outages happen, and overall they are dependable. We might in the future want to replace the current redirect edge rule with a more advanced solution based e.g. on CloudFlare Workers that will offer more flexibility and control over the freshness guarantees, and allow us to implement more exotic features.

It’s not exactly related to this issue, but the overall CDN propagation can be streamlined especially from the latency perspective. It’s currently a set of crons that can add up to 40m and a more event driven solution would allow us to propagate updates from game repo <1m, making="" any="" hotfixes="" land="" much="" quicker.<="" p="">