General Discussion

pnwmom

(108,976 posts) Sun Dec 30, 2018, 03:01 AM Dec 2018

Centurylink outage caused by a "single network card sending bad packets."

An outage that affected customers across the US and disabled 911 systems in several states and lasted for 50 hours.

Computer people, does that make sense? Is it excusable?

Link to tweet

Kevin Beaumont
✔
@GossiTheDog
After a 50 hour outage at 15 datacenters across the US — impacting cloud, DSL, and 911 services — CenturyLink say the outage is fixed, and was caused by a single network card sending bad packets (they’ve since applied bad packet filtering).

20 replies

= new reply since forum marked as read

Highlight:

Centurylink outage caused by a "single network card sending bad packets." (Original Post) pnwmom Dec 2018 OP

Sounds like there are more redundant systems involved... htuttle Dec 2018 #1

more CEO bonuses will cure anything Hermit-The-Prog Dec 2018 #3

Redundancy used to be the telephony standard Yo_Mama_Been_Loggin Dec 2018 #8

With the lingo being mostly over my head, after reading through many of those tweets yonder Dec 2018 #2

It is inexcusable TwistOneUp Dec 2018 #4

An H1-B doing low-voltage data cabling in St. Louis has to make at least $75,296/yr Recursion Dec 2018 #6

Denver, CO was where the incident originated according to the tweeted info. pnwmom Dec 2018 #9

The prevailing Denver wage is slightly lower: $73K Recursion Dec 2018 #12

I've seen TV interviews TwistOneUp Dec 2018 #18

I was a Perl programmer sleeping shifts with 3 others in a studio Recursion Dec 2018 #19

(I'm stuck with 'em......) lastlib Dec 2018 #7

Same here. They have a monopoly in our little town. trof Dec 2018 #20

There are frames and packets but not "frame packets". Recursion Dec 2018 #5

Yeah this is possible Midnightwalk Dec 2018 #10

Possible but also sounds like old whistler162 Dec 2018 #13

Some Cisco switches & routers have NSA backdoors which could be compromised. TheBlackAdder Dec 2018 #14

I was wondering if it was cisco Midnightwalk Dec 2018 #15

Sounds like it triggered problems at other sites. Igel Dec 2018 #17

still true dembotoz Dec 2018 #11

Electrical devices fail. It just happens. n/t ffr Dec 2018 #16

htuttle

(23,738 posts)

1. Sounds like there are more redundant systems involved...

Reply to pnwmom (Original post)

Sun Dec 30, 2018, 03:38 AM

Dec 2018

...in running an average e-commerce website selling shoes than for Century Link's 911 system.

The fact that there was one of ANYTHING tells me that they were doing it wrong. Very wrong.

Usually comes down to greed. Redundancy and reliability cost money.

Hermit-The-Prog

(33,328 posts)

3. more CEO bonuses will cure anything

Reply to htuttle (Reply #1)

Sun Dec 30, 2018, 03:54 AM

Dec 2018

Obviously, they are paying too much for engineers and not enough on upper management perks and bonuses. Cut back on peons and kick more money up the chain!

Yo_Mama_Been_Loggin

(107,922 posts)

8. Redundancy used to be the telephony standard

Reply to htuttle (Reply #1)

Sun Dec 30, 2018, 07:35 AM

Dec 2018

As more things became ethernet based redundancy seemed to abandoned.

yonder

(9,663 posts)

2. With the lingo being mostly over my head, after reading through many of those tweets

Reply to pnwmom (Original post)

Sun Dec 30, 2018, 03:54 AM

Dec 2018

the consensus is one of skepticism with respect to Century Link's explanation.

I'm still suspecting we are not getting the full story here, for whatever reason.

TwistOneUp

(1,020 posts)

4. It is inexcusable

Reply to pnwmom (Original post)

Sun Dec 30, 2018, 05:35 AM

Dec 2018

This is what happens when you import "techs" on h1b visas for minimum "6 guys in one bedroom" wages instead of using real IT people being paid a real wage. You "hire a price" instead of "hire an experienced professional" and when stuff like this happens you find out the difference between price and professional.

No one component should be able to affect anything other than the hardware in which it is installed. You plan for a piece of hardware to fail by having n+1 redundancy and quality-control checking the component output whenever possible or whenever "something going wrong" can have a dramatic impact.

Now that we've seen how easily CenturyLink's hosting can fail by allowing one component to impact such a wide area of service within their operation, would you choose them? I use hosting companies, and I rate them an "avoid" based on this information.

Recursion

(56,582 posts)

6. An H1-B doing low-voltage data cabling in St. Louis has to make at least $75,296/yr

Reply to TwistOneUp (Reply #4)

Sun Dec 30, 2018, 05:53 AM

Dec 2018

That's definitely not "6 guys in one bedroom" in St. Louis, which is where the problem was.

pnwmom

(108,976 posts)

9. Denver, CO was where the incident originated according to the tweeted info.

Reply to Recursion (Reply #6)

Sun Dec 30, 2018, 08:09 AM

Dec 2018

Of course, someone making $75K there should be able to afford his own apartment, also.

Recursion

(56,582 posts)

12. The prevailing Denver wage is slightly lower: $73K

Reply to pnwmom (Reply #9)

Sun Dec 30, 2018, 10:17 AM

Dec 2018

Good catch, thanks

TwistOneUp

(1,020 posts)

18. I've seen TV interviews

Reply to Recursion (Reply #6)

Sun Dec 30, 2018, 05:00 PM

Dec 2018

With H1Bers who were living with between four and six other guys in one room, but that is here in the Bay area, where rents are sky-high. I don't make this stuff up, and I regularly talk to other old-schoolers, who cannot find work or the work they find is priced the same as it was in the early 80's... The outsourcers and "consulting firms" who mass-apply for the H1B's fire the experts who have been doing this same work for 30 years or more and hire rookies to replace them. Why? Labor costs and ageism.

Recursion

(56,582 posts)

19. I was a Perl programmer sleeping shifts with 3 others in a studio

Reply to TwistOneUp (Reply #18)

Sun Dec 30, 2018, 06:09 PM

Dec 2018

That's the price of working in Silicon Valley and it's why I don't do it anymore.

H1-Bs are competing against the richest 20% of Americans; I'm fine with that. I could see raising the minimum to maybe $100K, but the program fundamentally works.

lastlib

(23,216 posts)

7. (I'm stuck with 'em......)

Reply to TwistOneUp (Reply #4)

Sun Dec 30, 2018, 05:53 AM

Dec 2018

Out here, they're my only viable option for "high-speed" service, and they suck. I think you're hitting the nail on the head--they're too cheap to pay real professionals to provide a quality product. I haven't experienced any problem during this "outage", but my overall experience with them has been that they have a pretty half-a**ed product.

trof

(54,256 posts)

20. Same here. They have a monopoly in our little town.

Reply to lastlib (Reply #7)

Sun Dec 30, 2018, 06:22 PM

Dec 2018

Coastal Alabama.

Recursion

(56,582 posts)

5. There are frames and packets but not "frame packets".

Reply to pnwmom (Original post)

Sun Dec 30, 2018, 05:49 AM

Dec 2018

Assuming they actually mean "frame" but are using the word "packet" because it's more familiar to people, then, yes, it's absolutely possible for one bad network card to do this, which is why you're supposed to use filters in your frame relay devices (which they apparently do now).

Also: PSTN over IP? Ewww.

Midnightwalk

(3,131 posts)

10. Yeah this is possible

Reply to pnwmom (Original post)

Sun Dec 30, 2018, 09:33 AM

Dec 2018

The problem is it sounds like the card didn't die. It failed in a way that it appeared to be working, but was actually putting out bad packets.

It is difficult to deal with these types of failures where a component doesn't simply die, but isn't working correctly in a general way. Still there are ways that are supposed to work. There are almost other issues. Code bugs in handling a particular failure, misconfiguration which might be detectable, sometimes another failure. In this case it sounds like bad packet filtering should have been enabled.

The other question that I have is why it took 50 hours to isolate the issue and restore service. I would guess they will be working that part of it as well. The time of year and getting people back from vacation might have added some number of hours, but there could be other issues to address.

You asked if it was excusable. It isn't acceptable and there should be a laundry list of actions for CenturyLink and their vendor.

Disclaimer: Not a network person and have no actual knowledge of what happened

whistler162

(11,155 posts)

13. Possible but also sounds like old

Reply to Midnightwalk (Reply #10)

Sun Dec 30, 2018, 12:15 PM

Dec 2018

or misconfigured switches are in the mix.

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750x_3560x/software/release/12-2_53_se/configuration/guide/3750xscg/swtrafc.html

TheBlackAdder

(28,183 posts)

14. Some Cisco switches & routers have NSA backdoors which could be compromised.

Reply to whistler162 (Reply #13)

Sun Dec 30, 2018, 01:05 PM

Dec 2018

.

Almost all residential and many corporate computers, routers, and operating systems have NSA backdoors in them, which allow for warrant searches of up to one million devices per warrant. This allows the government to view, add, remove or modify code on any device, without leaving any footprint.

Some of these tools were part of Snowden's NSA release, which now allows hackers, foreign adversaries, foreign friends, nation states, etc. to infiltrate almost every computer system in the country at will. It was uncovered a couple of years ago that Cisco had NSA code installed on some of their routers and switches.

.

Midnightwalk

(3,131 posts)

15. I was wondering if it was cisco

Reply to whistler162 (Reply #13)

Sun Dec 30, 2018, 01:18 PM

Dec 2018

Just because they are so ubiquitous.

On that link you provided (thanks) did you notice the best practices PDF? To me that looked like a nightmare if you had to change those settings. Not just to figure out what to change, but then to get each switch set up the same. I didn't look to see whether the defaults were the recommended values. They must have a way to make that simpler?

Old means higher failure rates, but also can mean bug fixes were not mapped backwards sometimes because of hardware differences. Sometimes you find a configuration default should be changed, but it is not generally a good idea to change defaults on an existing model. It would be interesting to me to know whether bad packet filtering defaulted to enabled.

My main point was that lack of redundancy or simple hardware failures are usually not the reason a system like this fails. There are usually multiple factors that caused the outage and the long recovery time. That's actually reassuring.

Sorry to go on or if I'm saying obvious stuff. This is a topic interesting to me and it's fun doing some speculation. Particularly when not involved in any way.