xesite/blog/backslash-kubernetes-2021-0...

228 lines
9.8 KiB
Markdown

---
title: "</kubernetes>"
date: 2021-01-03
---
Well, since I posted [that last post](/blog/k8s-pondering-2020-12-31) I have had
an adventure. A good friend pointed out a server host that I had missed when I
was looking for other places to use, and now I have migrated my blog to this new
server. As of yesterday, I now run my website on a dedicated server in Finland.
Here is the story of my journey to migrate 6 years of cruft and technical debt
to this new server.
Let's talk about this goliath of a server. This server is an AX41 from Hetzner.
It has 64 GB of ram, a 512 GB nvme drive, 3 2 TB drives, and a Ryzen 3600. For
all practical concerns, this beast is beyond overkill and rivals my workstation
tower in everything but the GPU power. I have named it `lufta`, which is the
word for feather in [L'ewa](https://lewa.within.website/dictionary.html).
## Assimilation
For my server setup process, the first step it to assimilate it. In this step I
get a base NixOS install on it somehow. Since I was using Hetzner, I was able to
boot into a NixOS install image using the process documented
[here](https://nixos.wiki/wiki/Install_NixOS_on_Hetzner_Online). Then I decided
that it would also be cool to have this server use
[zfs](https://en.wikipedia.org/wiki/ZFS) as its filesystem to take advantage of
its legendary subvolume and snapshotting features.
So I wrote up a bootstrap system definition like the Hetzner tutorial said and
ended up with `hosts/lufta/bootstrap.nix`:
```nix
{ pkgs, ... }:
{
services.openssh.enable = true;
users.users.root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPg9gYKVglnO2HQodSJt4z4mNrUSUiyJQ7b+J798bwD9 cadey@shachi"
];
networking.usePredictableInterfaceNames = false;
systemd.network = {
enable = true;
networks."eth0".extraConfig = ''
[Match]
Name = eth0
[Network]
# Add your own assigned ipv6 subnet here here!
Address = 2a01:4f9:3a:1a1c::/64
Gateway = fe80::1
# optionally you can do the same for ipv4 and disable DHCP (networking.dhcpcd.enable = false;)
Address = 135.181.162.99/26
Gateway = 135.181.162.65
'';
};
boot.supportedFilesystems = [ "zfs" ];
environment.systemPackages = with pkgs; [ wget vim zfs ];
}
```
Then I fired up the kexec tarball and waited for the server to boot into a NixOS
live environment. A few minutes later I was in. I started formatting the drives
according to the [NixOS install
guide](https://nixos.org/manual/nixos/stable/index.html#sec-installation) with
one major difference: I added a `/boot` ext4 partition on the SSD. This allows
me to have the system root device on zfs. I added the disks to a `raidz1` pool
and created a few volumes. I also added the SSD as a log device so I get SSD
caching.
From there I installed NixOS as normal and rebooted the server. It booted
normally. I had a shiny new NixOS server in the cloud! I noticed that the server
had booted into NixOS unstable as opposed to NixOS 20.09 like my other nodes. I
thought "ah, well, that probably isn't a problem" and continued to the
configuration step.
[That's ominous...](conversation://Mara/hmm)
## Configuration
Now that the server was assimilated and I could SSH into it, the next step was
to configure it to run my services. While I was waiting for Hetzner to provision
my server I ported a bunch of my services over to Nixops services [a-la this
post](/blog/nixops-services-2020-11-09) in [this
folder](https://github.com/Xe/nixos-configs/tree/master/common/services) of my
configs repo.
Now that I had them, it was time to add this server to my Nixops setup. So I
opened the [nixops definition
folder](https://github.com/Xe/nixos-configs/tree/master/nixops/hexagone) and
added the metadata for `lufta`. Then I added it to my Nixops deployment with
this command:
```console
$ nixops modify -d hexagone -n hexagone *.nix
```
Then I copied over the autogenerated config from `lufta`'s `/etc/nixos/` folder
into
[`hosts/lufta`](https://github.com/Xe/nixos-configs/tree/master/hosts/lufta) and
ran a `nixops deploy` to add some other base configuration.
## Migration
Once that was done, I started enabling my services and pushing configs to test
them. After I got to a point where I thought things would work I opened up the
Kubernetes console and started deleting deployments on my kubernetes cluster as
I felt "safe" to migrate them over. Then I saw the deployments come back. I
deleted them again and they came back again.
Oh, right. I enabled that one Kubernetes service that made it intentionally hard
to delete deployments. One clever set of scale-downs and kills later and I was
able to kill things with wild abandon.
I copied over the gitea data with `rsync` running in the kubernetes deployment.
Then I killed the gitea deployment, updated DNS and reran a whole bunch of gitea
jobs to resanify the environment. I did a test clone on a few of my repos and
then I deleted the gitea volume from DigitalOcean.
Moving over the other deployments from Kubernetes into NixOS services was
somewhat easy, however I did need to repackage a bunch of my programs and static
sites for NixOS. I made the
[`pkgs`](https://github.com/Xe/nixos-configs/tree/master/pkgs) tree a bit more
fleshed out to compensate.
[Okay, packaging static sites in NixOS is beyond overkill, however a lot of them
need some annoyingly complicated build steps and throwing it all into Nix means
that we can make them reproducible and use one build system to rule them
all. Not to mention that when I need to upgrade the system, everything will
rebuild with new system libraries to avoid the <a
href="https://blog.tidelift.com/bit-rot-the-silent-killer">Docker bitrot
problem</a>.](conversation://Mara/hacker)
## Reboot Test
After a significant portion of the services were moved over, I decided it was
time to do the reboot test. I ran the `reboot` command and then...nothing.
My continuous ping test was timing out. My phone was blowing up with downtime
messages from NodePing. Yep, I messed something up.
I was able to boot the server back into a NixOS recovery environment using the
kexec trick, and from there I was able to prove the following:
- The zfs setup is healthy
- I can read some of the data I migrated over
- I can unmount and remount the ZFS volumes repeatedly
I was confused. This shouldn't be happening. After half an hour of
troubleshooting, I gave in and ordered an IPKVM to be installed in my server.
Once that was set up (and I managed to trick MacOS into letting me boot a .jnlp
web start file), I rebooted the server so I could see what error I was getting
on boot. I missed it the first time around, but on the second time I was able to
capture this screenshot:
![The error I was looking
for](https://cdn.christine.website/file/christine-static/blog/Screen+Shot+2021-01-03+at+1.13.05+AM.png)
Then it hit me. I did the install on NixOS unstable. My other servers use NixOS
20.09. I had downgraded zfs and the older version of zfs couldn't mount the
volume created by the newer version of zfs in read/write mode. One more trip to
the recovery environment later to install NixOS unstable in a new generation.
Then I switched my tower's default NixOS channel to the unstable channel and ran
`nixops deploy` to reactivate my services. After the NodePing uptime
notifications came in, I ran the reboot test again while looking at the console
output to be sure.
It booted. It worked. I had a stable setup. Then I reconnected to IRC and passed
out.
## Services Migrated
Here is a list of all of the services I have migrated over from my old dedicated
server, my kubernetes cluster and my dokku server:
- aerial -> discord chatbot
- goproxy -> go modules proxy
- lewa -> https://lewa.within.website
- hlang -> https://h.christine.website
- mi -> https://mi.within.website
- printerfacts -> https://printerfacts.cetacean.club
- xesite -> https://christine.website
- graphviz -> https://graphviz.christine.website
- idp -> https://idp.christine.website
- oragono -> ircs://irc.within.website:6697/
- tron -> discord bot
- withinbot -> discord bot
- withinwebsite -> https://within.website
- gitea -> https://tulpa.dev
- other static sites
Doing this migration is a bit of an archaeology project as well. I was
continuously discovering services that I had littered over my machines with very
poorly documented requirements and configuration. I hope that this move will let
the next time I do this kind of migration be a lot easier by comparison.
I still have a few other services to move over, however the ones that are left
are much more annoying to set up properly. I'm going to get to deprovision 5
servers in this migration and as a result get this stupidly powerful goliath of
a server to do whatever I want with and I also get to cut my monthly server
costs by over half.
I am very close to being able to turn off the Kubernetes cluster and use NixOS
for everything. A few services that are still on the Kubernetes cluster are
resistant to being nixified, so I may have to use the Docker containers for
that. I was hoping to be able to cut out Docker entirely, however we don't seem
to be that lucky yet.
Sure, there is some added latency with the server being in Europe instead of
Montreal, however if this ever becomes a practical issue I can always launch a
cheap DigitalOcean VPS in Toronto to act as a DNS server for my WireGuard setup.
Either way, I am now off Kubernetes for my highest traffic services. If services
of mine need to use the disk, they can now just use the disk. If I really care
about the data, I can add the service folders to the list of paths to back up to
`rsync.net` (I have a post about how this backup process works in the drafting
stage) via [borgbackup](https://www.borgbackup.org/).
Let's hope it stays online!
---
Many thanks to [Graham Christensen](https://twitter.com/grhmc), [Dave
Anderson](https://twitter.com/dave_universetf) and everyone else who has been
helping me along this journey. I would be lost without them.