Well, we had some downtime starting 11/24. Here's a little explainer on what happened and how we fixed it.
On 11/23, we had some scheduled maintenance to take care of our web infrastructure. We have been using Kubernetes for webhosting, so we can quickly deploy web api changes, website changes, and other things related to the game and websites. Unknown to us, our SSL certificate was about to expire the next day.
SSL certificates are used to secure websites with HTTPS so data is protected and encrypted. We use free certificates from Let's Encrypt, which is a great service but expires every 3 months and requires automated renewal. At some point a few months ago, our certificates were no longer renewing and it was just a coincidence that they expired the day after we were doing infrastructure maintenance.
With our SSL certificates expired, this caused extended downtime for the game for 24 hours, and website access was down for longer. The game no longer trusted the connection to the web api server, which is used for connecting to the game server, knocking the whole game offline. Login was not affected because that goes through PlayFab's websites under a different domain.
In the end, Kubernetes ended up too complicated for us to manage while also trying to make a game. It was set up 2 years ago to make the website quick to deploy, but the overall management of it has too many complications and unknowns. The set up is complicated, the management is complicated, and when something goes wrong it is often hard to tell what or why.
Instead, it is time to step away from Kubernetes and go back to simpler Linux servers running Docker. Docker is a container management system that shares a lot of the same technologies as Kubernetes, but on a much smaller scale with easier to understand configurations.
As of now, the remaining websites are migrated over to their new homes, hopefully with less downtime in the future!