FCC post-mortem on AT&T outage uncovers similar QA issues to those that plagued CrowdStrike
Mitigation and recommendations
In light of the incident, AT&T has taken “numerous steps” to put better QA in place to avoid such slip-ups in the future, including additional steps that ensure confirmation that “required peer reviews have been completed” before deploying any maintenance work.
The provider also implemented technical controls within 48 hours of the incident to scan the network “for any network elements lacking the controls that would have prevented the outage,” so those controls could be put in place. AT&T continues to be engaged in a forensic investigation of the incident and also has enhanced its network for “robustness and resilience,” according to the report.
The FCC also recommended that only previously approved network changes developed “pursuant to internal procedures and industry best practices” should be deployed on the AT&T production network in the future. “It should not be possible to load changes that fail to meet those criteria,” the FCC said in the report.
Indeed, proper peer review also could have helped avoid the scenario that befell CrowdStrike on Friday, when “a defect found in a Falcon content update for Windows hosts” delivered the infamous Blue Screen of Death across millions of Windows systems worldwide, resulting in missed flights, closed call centers, and cancelled surgeries.
However, these reviews “are not adequate for the implementation of code at this level of hardware/software risk,” noted Marcus Merrell, principal test strategist at Sauce Labs.
“’Peer reviews’ imply that a peer is looking over code, to make sure it’s high quality,” he said. “It rarely, if ever, involves actually executing said code on the target hardware in the target environment.”