APIC crash hang kernel logs network Ubuntu

Random crash of Backup server

I have been trying to nail down an issue with my backup server. I thought I had solved it with a boot option of noapic , because the server worked fine after this setup. It booted reliably up each time.

Until yesterday, when it did something that it had done before. That is the NIC seems to turn off all by itself and then the OS sort of hangs. Before this point the server had sent out its e-mails and even started a couple of backups. I’m not able to login at the console, so the only option I have is a power off and then a boot.

Checking through the logs reveals nothing. DMESG, syslog, messages reveals nothing. No panic, nothing. All I can see is that after sometime backuppc can no longer ping machines and then backuppc soon stops – perhaps because the server is now hung. Pinging the backup server does not work either, so the server really is locked up.

It is very annoying to say the least. The previous server was rock-solid in this regard. It was extremely slow, but at least it booted and stayed up. This maybe because it had a more modern BIOS than the current unit. Which makes me think I will now have to hunt down a updated BIOS.

This really is the first Ubuntu/Linux unreliability I have had in over four (4) years of using Linux.

If anyone has some place to begin, please don’t hesitate to comment. I am running Ubuntu 10.04 LTS server edition. Its only purpose is to run Backuppc and this server is woken by WOL each night to start the backup and then shutsdown early morning when all backups are done.

edit: Just to let you know that this appeared to be a hardware issue and I have switched everything over to the original backup machine. Which is much slower but at least works. I now need to wonder what the problem is.

crash dhcp ext4 Linux oracle servers

How not to setup your servers

Firstly happy holidays everyone. This will most likely be my last post of the year. And what a year it has been. The one horse UK economy standing still and the UK consumer being ripped of so a select group can still get their multi-million dollar bonuses.

But enough of politics. This last month I’ve have a couple of interesting experiences with my Linux servers. Both occuring because of overnight – unsupervised reboots. What linux needs rebooting you say? Well, yes, if you intend to keep your servers at the latest patch level. usually this is about once a month. If you don’t bother then you could keep a Linux server indefinitly. They are super reliable.

I have a small number of Linux servers this article is about two of them. One is the DHCP and fetchmail and dovecot server for the network – cham01. The other is the NAS and LDAP server for the network – cham02. Maybe a bad combination of services? It may seem through a series of events that this maybe the case.

I had scheduled a reboot of cham02 due to a new kernel patch having been downloaded. So too avoid messing up the user’s – because this is where all the e-mail is stored along with a lot of oracle and mysql databases – I scheduled the reboot at night. When I came in the following morning I discovered the server had not rebooted because the file system had some corruption on it. The fsck at boot failed and the server was in a maintenence shell. So those reports of ext4 having some bugs weren’t false at all!

As a result of this the LDAP server and the NFS exports were not operating. The cham01 server could not deliver any e-mail because it could not authenticate any users nor match e-mail addresses to accounts. The interesting thing was that I could not ssh into cham01 and nor could I log in at the console with the local admin account. It needed to be rebooted. But only after cham02 was rebooted – to ensure that the NFS mount for e-mail is available. Bad design number one? Or do I just need better control of the machine and services.

So I’ve no decided that no more unsupervised reboots will take place. The next reboot was a precautionary one, to make sure the RAID and all services were operating properly. So I shutdown cham01 then shutdown cham02 and brought cham02 up first. Bad move, cham01 is the DHCP server and so cham02 didn’t get assigned a IP address all the networking services weren’t operating. I had to bring cham02 up, shutdown the e-mail server, then restart networking on cham02. Is really the best way to setup servers. I don’t think essentially services like e-mail should be on a DHCP server such that you encounter this dependancy loop.

I’m now going to move the e-mail and dovecot to cham02 and keep cham01 as DHCP, DNS and backup server only. It won’t permanently mount NFS drives. Hopefully that takes one of the dependancy links away. Though any chance that there is way to have a fallback IP setup when Linux can’t find a DHCP service. That would have helped.

All for now. I’ve started a new blog focusing specifically on my work as an Oracle consultant and integrator. I’m learning the ins and outs of Salesforce now as one of the platforms. Enjoy!