CenturyLink’s outage started in Denver, spread across country
ATMs failed in Idaho, Wyoming delayed lottery results, and 911 call centers in Washington, Arizona, Missouri and other states struggled with busy signals, dropped calls and missing location information.
At the Northern Colorado Medical Center in Greeley, staff couldn’t access vital patient records online. And in parts of New Mexico and Montana, Verizon faced service disruptions through no fault of its own.
Press reports have linked a long list of troubles to network problems suffered by telecommunications company CenturyLink, based in Monroe, La., two days after Christmas.
For about 30 hours, from the early morning hours of Dec. 27 until late on Dec. 28, chaos reigned on CenturyLink’s system. Western states that depend most heavily on the company’s fiber-optic system were hardest hit, but reports of outages and slower speeds came in from Alaska to Florida, according to downdetector.com.
“CenturyLink experienced a network event on one of our six transport networks beginning on December 27 that impacted voice, IP, and transport services for some of our customers. The event also impacted CenturyLink’s visibility into our network management system, impairing our ability to troubleshoot and prolonging the duration of the outage,” the company said in a statement.
Technicians were left scrambling trying to pinpoint the root cause, and that resulted in them losing time on fixes that didn’t work. New Orleans as ground zero was an early suspect, and then it was San Antonio, Texas. Teams, which had to make physical site visits, went into action in Kansas City, Mo., and then Atlanta, and so on.
But as they tried fixes in different areas, the problem didn’t go away. Making matters worse, the reporting system that gathered customer complaints also failed.
The source of all that turmoil and hours of angst for affected customers came down to one piece of equipment — a faulty third-party network management card in Denver, according to the company.
But how could one bad piece of equipment in Denver disrupt internet and phone service in large swaths of the country and impair critical services to thousands of customers for hours on end? And could it happen again?
Those are two questions the Federal Communications Commission, which has launched an investigation, wants answered, not to mention state utility regulators, computer scientists and irate customers.
A Sorcerer’s Apprentice
In the classic Disney film “Fantasia,” Mickey Mouse casts a spell on a broom to get it to carry the water buckets that he, as the apprentice, is using to fill a cistern for the sorcerer, who has just left the room.
Mickey then falls asleep and things go horribly wrong. The broom carries way too much water. Waking and realizing his predicament, Mickey tries to smash the broom to pieces. But the splinters turn into dozens of new brooms, carrying hundreds of buckets of water. The chamber gets flooded.
Computer scientists borrowed the term “Sorcerer’s Apprentice Syndrome” to describe what happens when a part of a network sends out “packets” of bad information that then get replicated and sent out over and over, said Craig Partridge, chair of the computer science department at Colorado State University in Fort Collins and a member of the Internet Hall of Fame.
Eventually, the system gets bogged down and can crash until the source of the problem is identified and the bad packets, which can keep ricocheting around, are cleared out of the system.
“The packet has a mistake. It thinks it is supposed to make lots of copies and send it anywhere. It then overloads the whole network,” said Partridge.
Partridge said he doesn’t have any specific knowledge of this outage, but based on public reports, CenturyLink appears to have suffered from what is a well-known problem that has plagued digital networks since their earliest days.
CenturyLink said the card was propagating “invalid frame packets” that were sent out over its secondary network, which controlled the flow of data traffic.
Here is a description of the Sorcerer’s Apprentice Syndrome at work, in the more technical terms provided by the company:
“Once on the secondary communication channel, the invalid frame packets multiplied, forming loops and replicating high volumes of traffic across the network, which congested controller card CPUs (central processing unit) network-wide, causing functionality issues and rendering many nodes unreachable,” the company said in a statement.
Once the syndrome gets going, it can be difficult to trace back to its original source and to stop, a big reason networks are designed to isolate failures early and contain them.
“We have learned through experience about these different types of failure modes. We build our systems to try and localize those failures,” Partridge said. “I would hope that what is going on is that CenturyLink is trying to understand why a relatively well-known failure mode has bit them.”
To resolve the problem, CenturyLink said it removed the network card at fault, disabled the channels that allowed for invalid traffic to get replicated across its network, and put in filters to catch the bad data.
It set up a more intense monitoring plan to spot problems faster and to terminate rogue packets before they can propagate. That took care of the bulk of problems, but a small group of customers had issues that were fixed case-by-case into a third day.
“CenturyLink teams worked around the clock until the issue was resolved,” said spokeswoman Linda Johnson. CenturyLink, which purchased Qwest Communications and Level 3 Communications, is an important employer in metro Denver.
A question of trust
When an airplane crashes, federal investigators will look for the black box and painstakingly reassemble every piece they can find to determine precisely what went wrong. If it was a mechanical issue, an order will go out on an inspection, fix or replacement. If it was a pilot error, new training rules are put in place.
The nation’s vital communication networks, however, are much less regulated than the airways and power grid. Even if similar protocols were in place after a failure, problems in the flow of light packets and voice signals are much more ephemeral and tougher to pin down.
“It is so unlikely they can reproduce the situation,” said Dirk Grunwald, a professor of computer science at the University of Colorado Boulder, who has witnessed scenarios where problematic components get plugged back in and work fine.
All hell might have broken loose because one bit of information in a packet came in sequence with another specific bit while the card was operating at a certain speed. A few milliseconds later or at a slightly different speed and the wicked spell may not have been cast, Grunwald said.
A more pertinent line of investigation would be why the card didn’t signal it was having problems and take itself out of the game like it was supposed to? And the card was encapsulating the faulty data, which allowed it to keep moving across the network, an issue the outside vendor is trying to understand, according to CenturyLink.
Beyond that, why didn’t other network safeguards keep the problem from getting out of hand.
Dan Massey, a computer science professor at the University of Colordado Boulder, said networks operate from an implicit assumption of trust as they communicate — “Be conservative in what you send and liberal in what you accept.”
Components assume the information they are receiving is coming from good players, not rogue or defective ones.
Most of the time, pick up a phone or go online and the process is smooth and seamless. What isn’t readily known is that technicians are constantly chasing problems and replacing parts and the system is making adjustments. It might even happen in the middle of a call, without a blip.
What networks struggle with is when a component goes bad but pretends to be normal, a failure known as a Byzantine Fault. If that fault happens in the “control plane” — the system that manages the flow of data and the problem detection systems — then things can spiral down quickly, Massey said.
Imagine cars on the road as bundles of information moving to where they need to go. If too many cars are in motion, then traffic will crawl to a halt. There might even be an accident. But communications networks are designed with a lot of spare capacity and an ability to clear accidents quickly and reroute traffic when jams appear.
That’s if the control plane is working. Now imagine if the traffic lights start acting erratically, like turning all the lights at an intersection red, or even worse, all of them green. That is a simplified way of describing the chaos CenturyLink technicians were dealing with.
But it didn’t take everything down. One of six transport system in CenturyLink’s network had problems, according to the company. That is why customers in Greeley and some mountain towns reported issues, while many customers in Denver and other areas didn’t notice anything amiss.
Don’t fail when it comes to 911
It is one thing if people can’t play Fornite or binge The Marvelous Mrs. Maisel because of slow speeds. It is an entirely different problem when 911 calls are disrupted, a reason CenturyLink is now facing an investigation from the FCC.
Johnson said that 911 calls were “largely completed” but that in some cases, the location information didn’t tag along. But press reports say some callers to 911 centers faced busy signals and dropped calls. Utility regulators in Wyoming and Washington state have said they will launch inquiries.
“The Colorado PUC has not opened its own investigation. However, the FCC has asked the states to help it gather information regarding the extent and impact of the outages, and PUC staff is assisting with the FCC’s investigation,” said Terry Bote, a spokesman for the state’s utility regulator.
Massey, who worked on cybersecurity issues at the Department of Homeland Security before joining CU, said most states have invested very little in cybersecurity and other safeguards when it comes to their 911 centers. They are not as failproof as they need to be.
The transition from analog to digital has left the nation’s 911 call centers much more capable, allowing them to better handle calls from mobile phones and even signals from automobiles involved in a crash. But it has also left those centers much less robust, as the problems on Dec. 27 showed.
Partridge said a deeper examination may show CenturyLink was doing everything right and it was hit by an entirely new and unexpected kind of failure. If so, the company, its vendors, and the computer science community will work on fixes.
But if an old-style Sorcerer’s Apprentice Syndrome was at fault, then blaming an outside party won’t fly.
“The network should not be so fragile that when you install third-party equipment and it fails, your network fails. Your network needs to be robust. That is standard operating procedure,” he said.