Introduction to the Bluesky Outage Incident
On Monday, Bluesky experienced a service interruption lasting approximately eight hours, affecting its user base. Jim, a Bluesky systems engineer, acknowledged the severity of the outage and offered an apology to users, emphasizing the gravity of the issue. Intermittent downtime on the preceding days hinted at the underlying problem, which culminated in the prolonged failure. Understanding the technical foundation of this event reveals key insights into system vulnerabilities.
Precursor Events and Initial Indicators
The issue was first flagged on Saturday, April 4, with a page notification signaling potential disruptions. Initial investigations pointed toward a possible network transit issue, but network monitoring systems did not reveal any irregularities. However, log spikes containing error messages began to correlate with user-facing traffic dips, indicating a deeper problem. These spikes suggested port exhaustion, a critical bottleneck for system performance.
The Role of Memcached and Data Plane Design
Blueskys data plane heavily relies on memcached for caching requests and reducing load on its Scylla database. The observed port exhaustion indicated that memcached was overwhelmed, likely due to excessive connections being initiated and terminated rapidly. This behavior stressed the system, leading to performance degradation and eventual downtime, as memcached struggled to handle the volume efficiently.
Deployment of a New Internal Service
The root cause of the outage was traced to a new internal service deployed the preceding week. While the service sent relatively few requests-less than three per second-those requests often contained batches of 15,000 to 20,000 URIs, far exceeding the typical load of 150 post lookups per request. This unexpected volume overwhelmed the RPC endpoint and compounded the strain on memcached.
Concurrency Handling and System Limitations
Bluesky's RPC handlers in the data plane generally implement bounded concurrency, limiting the number of simultaneous operations. However, the problematic endpoint did not enforce this restriction, leading to the creation of thousands of goroutines per request. This lack of concurrency control resulted in excessive connections flooding memcached, culminating in port exhaustion and service outages.
Lessons Learned and Moving Forward
This incident underscores the importance of rigorous observability and comprehensive system design. While Bluesky boasts advanced monitoring capabilities, assumptions about request sizes obscured critical issues. Addressing these blind spots and enforcing bounded concurrency across all endpoints will be pivotal in preventing future disruptions. This failure serves as a reminder of the necessity for continuous evaluation of internal services and their integration into existing architectures.