oxygen report

At 14.31 our nagios reporting systems alerted us that oxygen.3dpixelnet.com was not responding to http or email requests.

Unable to gain remote access (SSH was down) our engineer was dispatched to the datacentre and arrived at 14.58.

By 15.45 it was diagnosed that a memory module had failed. we replaced all the memory modules and by 16.00 the server was back online, but services were failing denying access to customers’ sites.

It was then a case of finding out why the services were not coming back online, and it was traced eventually to some obscure but critical libraries in both the /usr/lib and /lib directories that had bloated in size (due the memory corruption).

We restored individual library files from backups that had been taken a day earlier (19/08/08) as part of our standard backup procedure and chased down several bugs, as library files are heavily symlinked across multiple directories across the server. These files had to be downloaded to CD and carted across to oxygen.3dpixelnet.com manually as, with this corruption, SSH and rsync services were also down.

At 20.23 we rebooted the server for a final check and all services were online several minutes after that.

As a sidenote, this was our only server that did not use corsair memory. So folks, buy Corsair it’s not let us down yet :)