Mozilla blames 'interlocking complex systems' and confusion for Firefox's May add-on outage
Credit to Author: Gregg Keizer| Date: Fri, 26 Jul 2019 03:00:00 -0700
Mozilla has issued multiple after-action reports analyzing the major mix-up in May that crippled most Firefox add-ons. The reports also made recommendations for preventing similar incidents in the future.
The fiasco started just after 8 p.m. ET on Friday, May 3, when a certificate used to digitally sign Firefox extensions expired. Because Mozilla had neglected to renew the certificate, Firefox assumed add-ons could not be trusted – that they were potentially malicious – and disabled any already installed. Add-ons could not be added to the browser for the same reason.
Mozilla rushed a stop-gap fix to the browser via its Studies system, infrastructure normally responsible for pushing test code to small groups or collecting data on reactions to sponsored content. Because the Studies approach did not reach everyone, on May 5 and May 7 Mozilla shipped two Firefox updates – 66.0.4 and 66.0.5 – that addressed the certificate mess.
“The first question that everyone asks is, “How did you let this happen?'” wrote Firefox’s CTO Eric Rescorla in a post to a company blog. “At a high level, the story seems simple: we let the certificate expire. This seems like a simple failure of planning.”
Rescorla disputed that characterization, however. Saying that the situation was “more complicated” than that, he said the responsible team knew the certificate was expiring but assumed that the browser would ignore the expiration date because in an earlier incident certificate checking had been disabled. “This led to confusion about the status of intermediate certificate checking. Moreover, the Firefox QA plan didn’t incorporate testing for certificate expiration and therefore the problem wasn’t detected. This seems to have been a fundamental oversight in our test plan.”
Others covered the crisis from different angles in separate postmortems, including an incident report and a technical report.
The latter, written by Peter Saint-Andre and Matthew Miller, a principal software engineer and senior staff engineer, respectively, came to a similar conclusion. “This incident was not the fault of any individual or team but was the result of having an interlocking set of complex systems that were not well understood across all the relevant teams,” the two wrote.
Among the details in the Saint-Andre and Miller report was that Mozilla outsources its QA (quality assurance) testing to Cognizant Softvision, a multi-national firm with offices scattered from India and Ukraine to Romania and the U.S. “The lack of in-house or on-call QA resources caused delays in testing proposed fixes across various platforms because our external teams at Softvision were not immediately available through normal channels,” Saint-Andre and Miller said. “In fact, engaging with individual Softvision team members could have introduced legal complications and the potential for data leakage.”
While Rescorla had spelled out some of Mozilla’s failings in a blog post shortly after the add-on outage, he had promised a list of recommended changes for a later missive. His July 12 post and the July 2 report by Saint-Andre and Miller made good on that promise.
Among the recommendations: a rapid-response hotfix-delivery mechanism that could push emergency updates to Firefox users.
In May, Mozilla quickly created a temporary fix for the desktop versions of Firefox and pushed the patch to the browser using the Studies system. Mozilla turned to Studies to deploy the hotfix as soon as possible, rather than make users wait for a full browser update. Yet some reported that they didn’t receive the hotfix or that it had not enabled Firefox’s add-ons. If the user had disabled Studies, perhaps for privacy reasons (the mechanism is turned on by default), they would not have gotten the patch, for instance.
“The lesson here is that we need a mechanism that allows fast updates that isn’t coupled to Telemetry and Studies,” contended Rescorla. “The property we want is the ability to quickly deploy updates to any user who has automatic updates enabled. This is something our engineers are already working on.”
Saint-Andre and Miller echoed Rescorla, but also went into detail discussing the cryptographic signing of Firefox add-ons.
“Most fundamentally, the full Firefox team does not have a common understanding of the role, function and operation of cryptographic signatures for Firefox add-ons,” they wrote. “Although there are several good reasons for signing add-ons (monitoring add-ons not hosted on AMO, blocklisting malicious add-ons, providing cryptographic assurance by chaining add-ons to the Mozilla root), there is no shared consensus on the fundamental rationale for doing so. In addition, maintaining a full public key infrastructure (PKI) is a complex task and we do not necessarily have a firm grasp of the engineering and business tradeoffs involved.”
They recommended that more complete documentation be produced so everyone knows how add-on signing works, and assuming that Mozilla is committed to the current add-on signing approach, that improvements be made “to our certificate management processes, especially our key rollover strategies.”
Mozilla’s incident report included a detailed timeline of the add-on meltdown, from the certificate expiring to the final fix pushed to Firefox users 21 days later. Some clever wag at Mozilla named the graphic Armagaddon-timeline.prg!