Monday 12 November 2007

In a galaxy far, far away........

I took out a 'Root Server I' package with 1and 1. A very respectfully priced package with a server specification that could not be matched on the Internet. This coupled with 'unlimited bandwidth' made it a very attractive package that was untouchable.... Little did I no that UNTOUCHABLE was exactly what my server at 1and1 became.

The server specification included a Serial Console connection that enables the user access to the server should a serious mistake be made and you accidentally lock yourself out of your server and SSH no longer will allow your connections. The Serial Console connects the user directly a console connection on the server that is independent of SSH and is effectivley the same as working on the server locally in 'text console' mode.

This is vital for a development server as quite often a simple mistake can prevent the user from gaining access, such as a poorly written script could use up a large % of the CPU time and therefore there either isn't the RAM or CPU available to open a new SSH connection to the server..... The serial Console isn't effected by this, and it is possible on 90% of occasions to logon to the server and terminate the problematic script.

After 3 days of taking out my new server, selecting the Linux Image I required - Fedora Core 6 (x64), my server just stopped responding. Originally I thought this was a bit strange but didn't think much of it and decided to logon via the Serial Console........

Oh S**t, the Serial Console didn't work either. I logged on to my 1and1 control panel, and rebooted the server.........

After looking through the logs I noticed a number of concerning things, and sent the following to 1and1:


"I suspect that there is a hardware error with a new server package that I took out with 1and1. After about 3 days of non-activity the server becomes unresponsive. It is built with Fedora Core 6 minimal and the only things added to it were Apache, MySQL and PHP.

Logging in via Serial Console gives nothing and the server has to be rebooted from the 1and1 control panel before I get telnet access.

Dmesg shows what appears to be a failing RAID:

md: md6: raid array is not clean -- starting background reconstruction
raid1: raid set md6 active with 2 out of 2 mirrors
md: considering sdb5 ...
md: adding sdb5 ...
md: sdb1 has different UUID to sdb5
md: adding sda5 ...
md: sda1 has different UUID to sdb5
md: resync of RAID array md6
md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
md: using 128k window, over a total of 4891648 blocks.
md: created md5
md: bind
md: bind
md: running:
raid1: raid set md5 active with 2 out of 2 mirrors
md: considering sdb1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md1
md: bind
md: bind
md: running:
raid1: raid set md1 active with 2 out of 2 mirrors
md: ... autorun DONE.
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting. Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 380k freed
md: Autodetecting RAID arrays.

It would also appear that there is a problem with you DHCP server as no DHCP request is acknowledged as per /var/log/messages:

Nov 11 04:07:06 s15278325 syslogd 1.4.1: restart.
Nov 11 04:44:23 s15278325 dhclient: DHCPREQUEST on eth0 to 87.106.137.249 port 67
Nov 11 04:45:07 s15278325 last message repeated 4 times
Nov 11 04:46:19 s15278325 last message repeated 5 times
Nov 11 04:47:34 s15278325 last message repeated 4 times
Nov 11 04:48:50 s15278325 last message repeated 5 times
Nov 11 04:49:55 s15278325 last message repeated 4 times
Nov 11 04:51:05 s15278325 last message repeated 4 times
Nov 11 04:52:10 s15278325 last message repeated 5 times
Nov 11 04:53:13 s15278325 last message repeated 6 times
Nov 11 04:54:16 s15278325 last message repeated 4 times
Nov 11 04:55:23 s15278325 last message repeated 7 times
Nov 11 04:56:43 s15278325 last message repeated 6 times
Nov 11 04:57:48 s15278325 last message repeated 5 times
Nov 11 04:58:51 s15278325 last message repeated 6 times
Nov 11 04:59:52 s15278325 last message repeated 4 times

This is the third time this has happened to this server and would appreciate someone looking into it. It is not yet a current production server so if 1and1 need to carry out reboot and testing on the server to check the hardware then this will be fine as it will not affect any of our services. I finish work on Friday and by the time I log on Monday morning the server hardware is non-responsive again.

This needs to be rectified ASAP as this is a development server for a large World Wide Record Company and this will be holding up their development for their global website.

Kind regards


Netwarriors"

No comments: