About two weeks ago we had some planned maintenance to move our servers from one rack to another. Let's just say that this did not go as planned. At all. And here is the story...
Part One - Moving
Our hosting company has asked us to move from one part of the data center to another one. We were in need of a little upgrade and so we were all for it. Instead of moving our servers one by one, we thought it would be the easiest to shut them all down, move them over and bring them back up again.
Our setup is not very complicated. We have two servers of one unit height. There is a 4 Gbit/s bond between those two - hence it is easier to move them both at the same time.
Part Two - Upgrade!
We also wanted to upgrade our system a little bit, because the project is growing and we need more resources. The first thing that we did was to add a dedicated firewall. I am always preaching this to everyone to not run a virtual firewall, but that is exactly what we did for years - for financial reasons of course. Security and finances do not mix well, however.
We added a Lightning Wire Labs - IPFire Business Appliance which is perfect for our needs. Very little power consumption, but powerful enough for this. We did not need 10G here.
Setting this one up, was easy and done in about 10 minutes. We just restored a backup from the previous virtual firewall, did a reboot and we were online!
Part Three - Disaster
And this is pretty much the only thing that has worked on this day. From here, this is almost a comical story and if there was slapstick films set in data centres, this would be the plot.
Our servers are running oVirt. Some virtualisation software from RedHat that looked pretty good when we set it up in around 2015. However, it was buggy, slow and caused us loads and loads of problems over time. We were at a point were a reboot of the servers was really dangerous because you always had to pray that everything came back up. Underneath we used GlusterFS for storage replication. Another piece of software that I personally do not trust any more.
And so it came as we feared it: oVirt did not launch correctly after we restarted one of the servers. To not bore you with all the details, the engine (which is the machine that manages the whole cluster) did not want to start and it failed to connect to the nodes, so nothing worked.
One part of the migration was to replace oVirt. This was sort of the right time to do it, but only after we have completed the physical move. We were going to use Proxmox in the future which was dearly recommended to me. We were prepared to reinstall one of the two servers there and then to transfer our virtual machines over. But we were not even lucky with that. Even the USB stick that we brought with a fresh image of Debian Buster was not found by the server's firmware. Great! Nothing, not even the little things like this worked.
At this time, it was almost midnight and time was really running out. After finding another creative way to install the server with Debian, we were finally up and running and started migrating the virtual machines...
Part Four - "At least we are moving"
It was kind of a straight-forward thing to just copy over the virtual machine images and then launch them again. It would take some time, because we have a couple of terabytes of data, but at least things were going somewhere.
But it didn't work out like this. Some of the machines did not want to come up. They got corrupted. Unfortunately not during the transfer from one server to the other. Somehow they were corrupted on the old server.
At that point I was not really interested in finding out why and how exactly. It might have been oVirt. It might have been GlusterFS. It doesn't really matter in the end.
Part Five - Re-install all the things
How do you repair this? Some machines had a couple of blocks broken which could be repaired with a simple filesystem check. Some others were okay but showed old data. Some others were just scrambled egg.
It would have been dangerous to continue like this. We would find compromised data months or years later and of course we cannot afford that. So I made the decision to re-install all the machines that were broken and restore from the off-site backup.
We also were based on CentOS 7. Something that used to work very reliably for us. Up to that point where it become older and older and older. Right now, you have to add a bunch of third-party repositories to have a recent version of Postfix, Apache, and what not. This became high maintenance and every time updates were installed, something else broke. This is not really what I expected from an "enterprise" distribution and so some time ago we decided that we need to look around for something else. We did some trials with Debian Buster earlier this year which is a totally different experience. Things are more sane, solid and just straight-forward. This is more like what we need.
So I spent the weekend after all of this with very little sleep installing one machine after the other. Some of the broken data was resynced from our off-site backup. So nothing was lost, but it took a while to get all data back up again.
Part Six - Cleaning up
So this whole situation has a little bit of a silver lining. We have a brand new setup now which removes a couple of problems of the old one. It is not fully done, yet, but we are almost there.
This of course caused that it took a little bit longer to bring everything back up again, but I think that was a price worth paying. Ultimately we want to develop a firewall distribution here. One which I have not been working on in weeks now and that is not really nice. Our infrastructure is costing us a lot of time, but that will now be reduced. We have a good basis to introduce new features to the infrastructure that should be a lot easier to roll out and I am looking forward to this very much!
One of which is that we had to launch our new wiki. More on that coming soon...