Top
image credit: wirestock / Freepik

Deep Network Outages: What We Can Learn from the Optus Saga

May 22, 2024

Category:

On November 8th, 2023, Australian telecommunications provider Optus experienced a major network blackout. Over 10 million people and 400,000 businesses were left without access to the internet for approximately 12 hours. 

This was one of the largest internet outages in Australia’s history and raised concerns over the stability of network supply. It was also a warning shot to other telco providers, who would’ve shuddered at the $61m price tag for a 12-hour problem. The widespread ramifications filtered from the top down, from parent company Singtel, and landed squarely on the shoulders of Optus of Kelly Bayer Rosmarin, the company’s chief executive. 

But six months later, what the telecommunications industry is really interested in is better understanding deep network outages, what caused them, how it was fixed, and of course, how to prevent them. Here’s everything we’ve learned from the Optus saga: 

What is a ‘deep network’ problem?

The term “deep network” was first used to describe the Optus issue by Australian Minister for Communications Michelle Rowland. It refers to the core of a telecommunications network, the components that enable customers’ devices to access internet services. The Optus blackout was caused by an issue in this part. 

To fully understand the breakdown, we need to look at all three parts that make up a telecommunication network, namely: the core, the transit, and the access networks. The transit network uses fiber cables to connect the core to the access network, which includes local infrastructure, like mobile phone towers. 

When core network outages occur, this is usually caused by equipment or cable failures or software faults caused by a cyberattack. In the absence of threat actors, software faults are attributed to network updates that had an unexpected result, which could be the failure of one or more network systems. 

What caused the Optus outage

The company issued a statement at the time that said the fault was caused by “changes to routing information” from an international peering network following a routine software upgrade.

“These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves,” the company said.

“The restoration required a large-scale effort of the team and, in some cases, required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia. This is why restoration was progressive over the afternoon.” 

At the time, this statement felt vague, and light on information that pointed to a concrete source of the outage, but did give credence to speculation that this was a software-related incident. 

Dr. Mark Gregory, an associate professor in the School of Engineering at RMIT University, said the outage was caused by “human error”  that caused a “cascading failure”.

“The Optus statement is poorly worded, but it appears that a routine software upgrade to one or more key routers was the cause of the outage,” explains Gregory. Software upgrades differ from software updates, with upgrades having a higher margin for error. A software upgrade is an entirely new piece of software, whereas an update, also referred to as a patch, is just an enhanced version. 

“A cascading failure occurred when routing information from an international peering network was received and exceeded preset safety levels on key routers,” says Gregory.

Routing information is used to find the best path between one location on the Internet, the source, and another, the destination network. Internet peering is the mutual exchange of traffic between networks, and a router is a device that manages the flow of this traffic.

Too many of these “routing information changes” overwhelmed the key routers, which Gregory says then “disconnected from the Optus IP Core network, bringing down the entire network.”

How the industry responded to the Optus saga

So, should this outage have been prevented? “Optus has not explained what went wrong with the test process that should have occurred before the routing software upgrade occurred,” says Gregory. “Also, there is no explanation as to why there appears to have been a lack of redundancy of the key routers so that if there were a problem, the key routers would swap to the redundant routers, which you would expect to be running the previous iteration of the software.”

Weighing in on this matter, Mark Stewart, a Research Fellow at the Centre for Defense Communications and Information Networking at The University of Adelaide, added that “a major telco should have a disaster recovery plan which is more sophisticated than your average corporate network. At a minimum, they should have had a plan to revert the changes or remotely reboot their systems.”

He believes that the Optus incident is indicative of far more than just one organization’s failure, but rather spotlights the fragility of the entire industry. The bigger issue here lies with the fact that telecommunications is an important sector, relied upon by hospitals, public transport, and the banking and finance sectors. 

Graeme Hughes, the director of the Griffith Business Lab at Griffith University, added his concerns, saying: “In an era where society heavily depends on interconnected technology, establishing trust in service providers is crucial from a consumer standpoint.”

Adding to the frustration, many were incensed to find they couldn’t contact emergency services despite being assured they would be able to. At the time of the outage, affected customers were encouraged to call 000 and in order to reach the ambulance, the fire brigade and other first responders. Landline customers were unable to do so. Mobile customers, however, were automatically switched over to other carriers in accordance with Australian regulations. 

Hughes’ final statement, perhaps, sums up the biggest lesson for consumers. “For government, business, and domestic users of internet and phone services, there are some clear lessons from the Optus outage. Don’t have all your phones and Internet provided by the one company. If you are providing safety critical services, have connections to multiple networks.”

Optus will appear before the Senate and will undergo a review by the Federal Government to examine the major impacts of the network failure and provide a best-practice kit to prevent a similar occurrence. 

Key Learnings and Recommendations

For CIOs, the Optus saga was a stark reminder that resilience and recovery are crucial factors to success. It would come as no surprise that many were prompted to review and reassess their own disaster recovery plans! With Kelly Bayer Rosmarin’s resignation, this incident underscores the severity of the issue and the massive stakes at risk when a major telecommunications company has a blackout. Here are some of the key takeaways: 

1. Recovery plans are crucial

The Senate inquiry revealed that Optus “didn’t have a plan in place for that specific scale of outage.” This, according to Lambo Kanagaratnam, Optus’ Managing Director of Networks. Adding fuel to the fire, it was revealed that while customers were not equipped for the outage, Rosmarin prepared herself for a potential issue by using a spare Vodafone SIM card. 

2. Prioritize customer connectivity (even if limited)

During the blackout, Optus directed the public to dial 000 if they needed to contact emergency services. Approximately 228 of these calls weren’t connected; one of which was a man contacting medical services on behalf of his colleague having a heart attack. This highlighted just how delicate and interconnected Australian systems are and placed a spotlight on how CIOs are responsible for more than just telecommunications; their services are sometimes a lifeline, and a failure for one can impact many. Customer connectivity is crucial and needs to be restored as soon as possible, even if in a limited format. 

3. Communication is crucial

Optus came under fire for offering vague explanations that left even the most senior telco experts confused and dissatisfied. By the time a statement was put together, several opportunities were missed to address their broad customer base, including the morning news. This meant that many customers were left disgruntled when they discovered the outage, and without concrete information, assumptions were made, and a media narrative formed. CIOs would do well to ensure that a crisis communication pack is updated regularly and that the media is well-informed in efforts to contain the ensuing panic and collective anger. 

For CIOs and other C-suite executives, these incidents are an exercise in collaboration. Requiring more than just IT systems management, leaders are called upon to proactively juggle disaster recovery in all its forms: customer support, media management, and strategic prioritization. Without a doubt, the Optus saga was an unintended lesson on how to strengthen defenses and respond to challenges.