Last Wednesday we had an outage with downloading game updates. This post is a postmortem about what happened and also provides background on how game distribution works currently in BAR.
On Wednesday 17th October, for a total duration of ~2h, we had an outage of game updates serving infrastructure which prevented players from downloading versions of the game needed to play in multiplayer matches. For additional 3 days, changes from the source code repository were not being propagated to players.
The root cause of the issue was a bug introduced by our content delivery network (CDN) provider BunnyCDN in the evaluation of the Edge Rules that we use for redirecting traffic between URLs. The broken URL redirect caused serving of a stale version of game versions index file, old as far as April 2023, which caused lobby to be unable to find and download newer versions of the game requested by the game servers.
Each player in a match must have exactly the same version of the game. Developers submit changes multiple times a day, and they are automatically pushed to all players. It would be very inefficient if after every single change, the whole game would have to be redownloaded. In addition, different lobbies might be using different versions of the game, so multiple versions of the game need to be installed at the same time.
This problem is solved by using content-addressed storage format. Each file in the game repository is hashed and stored compressed in the pool/
directory by its hash. Then a list of all the files with their hashes in a given game version is put into a file, hashed, and written compressed into packages/{hash}.sdp
. To figure out which sdp
file corresponds to which game version an versions index file versions.gz
is created that maps from game version to the sdp
hash. It’s the freshness of this versions index file that caused the issue during this outage.
Propagating a new version of the game from the source repository is a multi step process:
The build server pulls code from the main game repo periodically. It uses RapidTools to generate new pool files for change files, a new sdp
file for the game version, and updates the versions index file.
It is directly from this server that players are fetching updates when selecting the “no-CDN” option in the launcher.
Periodically we synchronize the versions from the build server to the BunnyCDN SSD Edge Storage main region in Germany using a bespoke syncer.
BunnyCDN in the background asynchronously replicates the data to other edge storage regions in New York, Los Angeles and Singapore.
Periodically we use a script to copy the versions.gz
to fresh/versions_YYYYMMDDTHHMMSS.gz
in the main edge storage region and update Edge Rule to redirect from https://repos-cdn.beyondallreason.dev/byar/versions.gz
to the newly created https://repos-cdn.beyondallreason.dev/byar/fresh/versions_YYYYMMDDTHHMMSS.gz
.
The edge rule forwards the traffic to the fresh file only when the User-Agent
header doesn’t contain latestreplicated
substring in it. Script uses that for fetching the real versions.gz
and it runs in Germany to make sure it’s contacting the main storage region.
BunnyCDN quickly (within seconds) propagates the Edge Rule update to all the CDN edge servers.
Users and BAR game servers contact the network of tens of CDN edge servers to fetch files. Edge servers apply edge rules and fetch files from cache or the closest edge storage region. If the file is not present in the closest edge storage region, they do a second lookup in the main storage region.
Cache in the edge servers is configured to expire only after 1 year because almost all the files are immutable. There is an additional edge rule configured that overrides the cache expiration for versions.gz
and a few other files to 0.
The issue is that replication of data in step 3 is asynchronous and can lag for an unknown duration of time. Since a few months I have a dedicated prober that measures the replication lag, and within just last 14d, Singapore region lag once got up to 39h and just today Los Angeles lagged for 1h. That’s bad enough that we can’t ignore it for our needs.
The mitigation is to use the edge rule, whose replication is very quick, and depend on the behavior that CDN edge servers do a fallback to the main storage region when file is not found in the closest one.
Oct 17th (times in UTC)
Oct 20
With all that background, we can summarize the root cause of the incident.
For the first portion of the incident the redirect edge rule from step 4 just completely stopped working and users were downloading the original versions.gz
file. For some reason also the edge rule described in step 6 that overrides the cache expiration to 0 didn’t work and users were getting a version that is multiple months old. This was mitigated manually, via the BunnyCDN control panel, pruning the versions.gz
from cache.
After some configuration change by support, the second portion of the incident started where the redirect started working for 100% of the traffic, ignoring the User-Agent
header used by the script from step 4, and breaking its functionality as it was not able to determine that something needs to be copied. This caused the redirect to point at a few commits older versions.gz
file, causing the issue on some game servers.
The final manual mitigation was to:
Then we waited 3 days for BunnyCDN to release a software update that fixed that bug and re-enabled automated pushes after that.
It would be great to have more contributors interested in this area so that when things go wrong we have better coverage timezone wise and don’t need to depend on luck. Having a better monitoring and alerting setup for things in BAR infra would also allow us to make the players experience better.
We currently don’t plan to stop using BunnyCDN as it still fits the best for what we need. Outages happen, and overall they are dependable. We might in the future want to replace the current redirect edge rule with a more advanced solution based e.g. on CloudFlare Workers that will offer more flexibility and control over the freshness guarantees, and allow us to implement more exotic features.
It’s not exactly related to this issue, but the overall CDN propagation can be streamlined especially from the latency perspective. It’s currently a set of crons that can add up to 40m and a more event driven solution would allow us to propagate updates from game repo <1m, making="" any="" hotfixes="" land="" much="" quicker.<="" p="">1m,>