Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. The team had to address a number of challenges in sequence to understand the root cause and bring the service back up. The outage was unique in both duration and complexity. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved. Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident. We’re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We sincerely apologize to our community for the downtime. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy. ¹ Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services. Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |