General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsCenturylink outage caused by a "single network card sending bad packets."
An outage that affected customers across the US and disabled 911 systems in several states and lasted for 50 hours.
Computer people, does that make sense? Is it excusable?
Link to tweet
✔
@GossiTheDog
After a 50 hour outage at 15 datacenters across the US impacting cloud, DSL, and 911 services CenturyLink say the outage is fixed, and was caused by a single network card sending bad packets (theyve since applied bad packet filtering).
htuttle
(23,738 posts)...in running an average e-commerce website selling shoes than for Century Link's 911 system.
The fact that there was one of ANYTHING tells me that they were doing it wrong. Very wrong.
Usually comes down to greed. Redundancy and reliability cost money.
Hermit-The-Prog
(33,328 posts)Obviously, they are paying too much for engineers and not enough on upper management perks and bonuses. Cut back on peons and kick more money up the chain!
Yo_Mama_Been_Loggin
(107,922 posts)As more things became ethernet based redundancy seemed to abandoned.
yonder
(9,663 posts)the consensus is one of skepticism with respect to Century Link's explanation.
I'm still suspecting we are not getting the full story here, for whatever reason.
TwistOneUp
(1,020 posts)This is what happens when you import "techs" on h1b visas for minimum "6 guys in one bedroom" wages instead of using real IT people being paid a real wage. You "hire a price" instead of "hire an experienced professional" and when stuff like this happens you find out the difference between price and professional.
No one component should be able to affect anything other than the hardware in which it is installed. You plan for a piece of hardware to fail by having n+1 redundancy and quality-control checking the component output whenever possible or whenever "something going wrong" can have a dramatic impact.
Now that we've seen how easily CenturyLink's hosting can fail by allowing one component to impact such a wide area of service within their operation, would you choose them? I use hosting companies, and I rate them an "avoid" based on this information.
Recursion
(56,582 posts)That's definitely not "6 guys in one bedroom" in St. Louis, which is where the problem was.
pnwmom
(108,976 posts)Of course, someone making $75K there should be able to afford his own apartment, also.
Recursion
(56,582 posts)Good catch, thanks
TwistOneUp
(1,020 posts)With H1Bers who were living with between four and six other guys in one room, but that is here in the Bay area, where rents are sky-high. I don't make this stuff up, and I regularly talk to other old-schoolers, who cannot find work or the work they find is priced the same as it was in the early 80's... The outsourcers and "consulting firms" who mass-apply for the H1B's fire the experts who have been doing this same work for 30 years or more and hire rookies to replace them. Why? Labor costs and ageism.
Recursion
(56,582 posts)That's the price of working in Silicon Valley and it's why I don't do it anymore.
H1-Bs are competing against the richest 20% of Americans; I'm fine with that. I could see raising the minimum to maybe $100K, but the program fundamentally works.
lastlib
(23,216 posts)Out here, they're my only viable option for "high-speed" service, and they suck. I think you're hitting the nail on the head--they're too cheap to pay real professionals to provide a quality product. I haven't experienced any problem during this "outage", but my overall experience with them has been that they have a pretty half-a**ed product.
trof
(54,256 posts)Coastal Alabama.
Recursion
(56,582 posts)Assuming they actually mean "frame" but are using the word "packet" because it's more familiar to people, then, yes, it's absolutely possible for one bad network card to do this, which is why you're supposed to use filters in your frame relay devices (which they apparently do now).
Also: PSTN over IP? Ewww.
Midnightwalk
(3,131 posts)The problem is it sounds like the card didn't die. It failed in a way that it appeared to be working, but was actually putting out bad packets.
It is difficult to deal with these types of failures where a component doesn't simply die, but isn't working correctly in a general way. Still there are ways that are supposed to work. There are almost other issues. Code bugs in handling a particular failure, misconfiguration which might be detectable, sometimes another failure. In this case it sounds like bad packet filtering should have been enabled.
The other question that I have is why it took 50 hours to isolate the issue and restore service. I would guess they will be working that part of it as well. The time of year and getting people back from vacation might have added some number of hours, but there could be other issues to address.
You asked if it was excusable. It isn't acceptable and there should be a laundry list of actions for CenturyLink and their vendor.
Disclaimer: Not a network person and have no actual knowledge of what happened
whistler162
(11,155 posts)or misconfigured switches are in the mix.
https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750x_3560x/software/release/12-2_53_se/configuration/guide/3750xscg/swtrafc.html
TheBlackAdder
(28,183 posts).
Almost all residential and many corporate computers, routers, and operating systems have NSA backdoors in them, which allow for warrant searches of up to one million devices per warrant. This allows the government to view, add, remove or modify code on any device, without leaving any footprint.
Some of these tools were part of Snowden's NSA release, which now allows hackers, foreign adversaries, foreign friends, nation states, etc. to infiltrate almost every computer system in the country at will. It was uncovered a couple of years ago that Cisco had NSA code installed on some of their routers and switches.
.
Midnightwalk
(3,131 posts)Just because they are so ubiquitous.
On that link you provided (thanks) did you notice the best practices PDF? To me that looked like a nightmare if you had to change those settings. Not just to figure out what to change, but then to get each switch set up the same. I didn't look to see whether the defaults were the recommended values. They must have a way to make that simpler?
Old means higher failure rates, but also can mean bug fixes were not mapped backwards sometimes because of hardware differences. Sometimes you find a configuration default should be changed, but it is not generally a good idea to change defaults on an existing model. It would be interesting to me to know whether bad packet filtering defaulted to enabled.
My main point was that lack of redundancy or simple hardware failures are usually not the reason a system like this fails. There are usually multiple factors that caused the outage and the long recovery time. That's actually reassuring.
Sorry to go on or if I'm saying obvious stuff. This is a topic interesting to me and it's fun doing some speculation. Particularly when not involved in any way.
Igel
(35,300 posts)So instead of having just one problem, they had a few places to deal with and needed to first realize they had several problem sites.