Cloudflare recently experienced a failure in their Cloudflare logs data logistics service due to an error in a software update. The incident took place on November 14 and lasted approximately 3.5 hours, resulting in the loss of about 55% of the collected logs according to Cloudflare’s report.
Cloudflare logs are crucial for collecting data generated by cloud services and sending them to customers for analysis. They assist in debugging, configuring setups, and creating analytics, particularly with data sourced from other servers. To streamline their operations, Cloudflare employs Logpush, which bundles logs into fixed-size packages and sends them out at regular intervals as explained here.
The issue arose following modifications to Logpush meant to support a new dataset. However, the update introduced an error causing Logfwdr – the system responsible for preparing logs for transmission – to mistakenly presume that customers had not configured data transfers. Consequently, logs stopped transmitting, resulting in significant data loss.
Although Cloudflare’s staff identified and reverted the change within 5 minutes, this triggered another error in LogfWDR. Instead of transmitting data exclusively for customers with Logpush active, the system began processing logs from all customers. This data overload led to system failure and further loss of files.
Cloudflare acknowledged their responsibility for the incident, admitting that their measures to prevent such situations were inadequate. They likened the situation to an unfastened seatbelt in a car, emphasizing the importance of adhering to basic rules even when safety systems are in place.
In response to this incident, Cloudflare plans to implement automatic alerts to prevent configuration errors from being overlooked. Additionally, they aim to enhance their testing systems to avert overloads and failures in their data centers and networks going forward.