Outage 01/06/08 - 02/06/08

Hopefully, as per our recent newsletter you’ve been keeping up with our server notifications on twitter.com/3dpixelnet which has catalogued most of what I’m going to expand on below.

We host our 2 DNS servers ns1.3dpixelnet.com and ns2.3dpixelnet.com with a server compan called EV1/ThePlanet in the USA. Recently they were bought out by ThePlanet to form one of the largest colocation and hosting providers in the world. I must stress that we only host our DNS at ThePlanet, all our actual hosting platforms are based in the UK, in Manchester. We host our DNS off-network at ThePlanet (ironically now it seems) for redundancy reasons.

ThePlanet has 6 datacentres in the USA all geographically seperated. We initially arranged, many years ago now, to host the 2 DNS platforms in seperate datacentres for obvious redundancy reasons.

Fast forward to 11.30pm on Saturday evening (31/05/08). Our NAGIOS notification system informed our staff that ns1.3dpixelnet.com had a problem. Naturally we did all the usual remote checking to see what the issue was. ns2.3dpixelnet.com automatically picks up the DNS slack, as it should do at this time. All services were up and running at this point.

20 minutes later (11.50pm) our staff receive a second NAGIOS notification that ns2.3dpixelnet.com was down. At this point we were relying on DNS caching at customer ISPs.

What is caching? To put not to fine a point on it, caching is what your ISP (BTInternet, Bethere, virgin etc..) does to save money. When a fellow BTInternet customer visits http://3dpixel.net for example, a lookup is made by BT to ns1.3dpixelnet.com or ns2.3dpixelnet.com and it gives an IP address. This is then saved at BT and stored for a period of approximately 24 hours. Any further visits by a BTInternet user will use that cached record, and not directly query ns1.3dpixelnet.com or ns2.3dpixelnet.com. Still, ‘new’ visitors could not see sites hosted on our platforms and we immediately contacted theplanet.

Please note, at this point all our servers in our own facility were operating correctly.

The following links speak for themselves:
http://www.allheadlinenews.com/articles/7011131199
http://tech.slashdot.org/article.pl?sid=08/06/01/1715247&from=rss

In summary, a datacentre transformer at theplanet’s Houston datacentre exploded (!). This forced them to shut down all power including the backup generators. They’ve been repairing it ever since but it’s still not up. Meanwhile, we then found out that our contract for datacentre seperation never actually occured and both servers were in this one datacentre. This is of no help to you now but we’ll be following that up at a later date.

We’ve been working since then to get the DNS records on ns2.3dpixelnet.com away from theplanet and on to a server in the UK. This, as of 9.30am 02/06/08, has been achieved.

We fully appreciate that this has had a massive effect on your service and we sincerely apologise for the outage. In hindsight we could not have done much more outside of accusing this third party company of lying in their contract to ourselves. We rightly, and genuinely believed, backed by a contract, that our DNS platforms were rightly geographically seperated.

Needless to say, we are moving our DNS services away from this third party as soon as technically possible. We don’t believe that a datacentre of that size should be crippled by a problem like this. After all, it’s what we pay for and expect from a datacentre in the first place; especially one as large as theplanet.

As of writing, 99% of all websites are currently up and running as is email and the spamgate system. Outgoing email via our SMTP service is being rebuilt as it was heavily integrated with the USA DNS servers and may throw your user/pass out. If this is a major issue right now please contact support@3dpixel.net with your domain, smtp username and password and we will manually add it. Otherwise SMTP will be coming back over the course of today.

Again, please accept our apologies for this outage.

If you feel we have missed a domain or service please contact us at support@3dpixel.net or indeed visit http://irc.3dpixel.net/irc.cgi or indeed http://deviantforums.com

Alan Ogden
3dpixel.net

Summary
Explosion at THIRD PARTY datacentre hosting 2 of our DNS servers
DNS is taken offline
DNS moved to UK systems 24 hours later
All services being restored