Page 1 of 1

Double server trouble

Posted: Mon Nov 12, 2012 5:59 pm
by seraph0x
Yesterday our chat server had an unusual outage. Amazon EC2 was reporting hardware issues and it took several messages back and forth between their support and us to fully resolve the issue by moving our chat server onto new hardware.

In the meantime, halfway around the globe in the Netherlands, our primary hosting facility was experiencing network issues. Both problems seem to be resolved as of now. Sorry for the inconvenience.

Re: Double server trouble

Posted: Mon Nov 12, 2012 6:12 pm
by sissy jasmine
*does a little dance, hugs Seraph, then runs into chat*

Re: Double server trouble

Posted: Mon Nov 12, 2012 8:50 pm
by DoxysTurtle
Thank's Seraph0x for all your hard work and keeping us informed!

Re: Double server trouble

Posted: Mon Nov 12, 2012 9:52 pm
by SexualChoc
:thankyou:


seraph0x

I do appreciate all the work you do

Re: Double server trouble

Posted: Tue Nov 13, 2012 8:07 am
by Recacha
Great job Seraph0x!

I didn't know the epicentre were The Netherlands. Awesome! :closedeyes:

Re: Double server trouble

Posted: Tue Nov 13, 2012 2:16 pm
by seraph0x
Here is some info from our provider about what caused the outage. (Warning: Geek talk ahead.)
Yesterday we had a network outage starting at 05:03 CET till 10:22 CET. The cause of this downtime was a customer who was sending out a multicast broadcast. Despite having several features enabled on our switches to prevent loops, it did not stop this broadcast from reaching the core of the network and also causing problems at our upstream provider.

At the core a protocol is used to provide the redundant gateway (VRRP), this protocol relies on multicast packets being able to reach the master/slave router. Because of the broadcast storm these packages were not being received which resulted in a constant link flap at the core of the network. These link flaps resulted in a complete network outage.

The server causing this problems has been removed from the network at 10:00 CET, after both us and the upstream provider had confirmed the network was stable they have enabled our uplinks after which all equipment came back online.

We are very sorry for any inconvenience caused by this issue and have opened a case at Brocade technical support to see which features we can implement to prevent a broadcast storm in the future.

Because of this downtime we have automatically applied an SLA credit for all our customers.
The SLA credit is 40%, which converts in 12 days being added to your server(s) expiry date.
They are a very small provider, so they do run into issues like this occasionally, but they always care and their customer support is among the best I've ever experienced, so I'm willing to look past a little bit of downtime. :-)

Re: Double server trouble

Posted: Tue Nov 13, 2012 2:59 pm
by marspank
*wonders if it is wrong that the only thing he remembers from the explanation is the use of master/slave*

Re: Double server trouble

Posted: Tue Dec 04, 2012 1:15 pm
by Snoopy76
*Giggles at mars' reply*

Snoopy x