site/blog/nixos-encrypted-secrets-202...

14 KiB
Raw Blame History

title date series tags
Encrypted Secrets with NixOS 2021-01-20 nixos
age
ed25519

Encrypted Secrets with NixOS

One of the best things about NixOS is the fact that it's so easy to do configuration management using it. The Nix store (where all your packages live) has a huge flaw for secret management though: everything in the Nix store is globally readable. This means that anyone logged into or running code on the system could read any secret in the Nix store without any limits. This is sub-optimal if your goal is to keep secret values secret. There have been a few approaches to this over the years, but I want to describe how I'm doing it. Here are my goals and implementation for this setup and how a few other secret management strategies don't quite pan out.

At a high level I have these goals:

  • It should be trivial to declare new secrets
  • Secrets should never be globally readable in any useful form
  • If I restart the machine, I should not need to take manual human action to ensure all of the services come back online
  • GPG should be avoided at all costs

As a side goal being able to roll back secret changes would also be nice.

The two biggest tools that offer a way to help with secret management on NixOS that come to mind are NixOps and Morph.

NixOps is a tool that helps administrators operate NixOS across multiple servers at once. I use NixOps extensively in my own setup. It calls deployment secrets "keys" and they are documented here. At a high level they are declared like this:

deployment.keys.example = {
  text = "this is a super sekrit value :)";
  user = "example";
  group = "keys";
  permissions = "0400";
};

This will create a new secret in /run/keys that will contain our super secret value.

Wait, isn't /run an ephemeral filesystem? What happens when the system reboots?

Let's make an example system and find out! So let's say we have that example secret from earlier and want to use it in a job. The job definition could look something like this:

# create a service-specific user
users.users.example.isSystemUser = true;

# without this group the secret can't be read
users.users.example.extraGroups = [ "keys" ]; 

systemd.services.example = {
  wantedBy = [ "multi-user.target" ];
  after = [ "example-key.service" ];
  wants = [ "example-key.service" ];
  
  serviceConfig.User = "example";
  serviceConfig.Type = "oneshot";
  
  script = ''
    stat /run/keys/example
  '';
};

This creates a user called example and gives it permission to read deployment keys. It also creates a systemd service called example.service and runs id(1) stat(1) to show the permissions of the service and the key file. It also runs as our example user. To avoid systemd thinking our service failed, we're also going to mark it as a oneshot.

Altogether it could look something like this. Let's see what systemctl has to report:

$ nixops ssh -d blog-example pa -- systemctl status example
● example.service
     Loaded: loaded (/nix/store/j4a8f6mnaw3v4sz7dqlnz95psh72xglw-unit-example.service/example.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Wed 2021-01-20 20:53:54 UTC; 37s ago
    Process: 2230 ExecStart=/nix/store/1yg89z4dsdp1axacqk07iq5jqv58q169-unit-script-example-start/bin/example-start (code=exited, status=0/SUCCESS)
   Main PID: 2230 (code=exited, status=0/SUCCESS)
         IP: 0B in, 0B out
        CPU: 3ms

Jan 20 20:53:54 pa example-start[2235]:   File: /run/keys/example
Jan 20 20:53:54 pa example-start[2235]:   Size: 31                Blocks: 8          IO Block: 4096   regular file
Jan 20 20:53:54 pa example-start[2235]: Device: 18h/24d        Inode: 37428       Links: 1
Jan 20 20:53:54 pa example-start[2235]: Access: (0400/-r--------)  Uid: (  998/ example)   Gid: (   96/    keys)
Jan 20 20:53:54 pa example-start[2235]: Access: 2021-01-20 20:53:54.010554201 +0000
Jan 20 20:53:54 pa example-start[2235]: Modify: 2021-01-20 20:53:54.010554201 +0000
Jan 20 20:53:54 pa example-start[2235]: Change: 2021-01-20 20:53:54.398103181 +0000
Jan 20 20:53:54 pa example-start[2235]:  Birth: -
Jan 20 20:53:54 pa systemd[1]: example.service: Succeeded.
Jan 20 20:53:54 pa systemd[1]: Finished example.service.

So what happens when we reboot? I'll force a reboot in my hypervisor and we'll find out:

$ nixops ssh -d blog-example pa -- systemctl status example
● example.service
     Loaded: loaded (/nix/store/j4a8f6mnaw3v4sz7dqlnz95psh72xglw-unit-example.service/example.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

The service is inactive. Let's see what the status of example-key.service is:

$ nixops ssh -d blog-example pa -- systemctl status example-key
● example-key.service
     Loaded: loaded (/nix/store/ikqn64cjq8pspkf3ma1jmx8qzpyrckpb-unit-example-key.service/example-key.service; linked; vendor preset: enabled)
     Active: activating (start-pre) since Wed 2021-01-20 20:56:05 UTC; 3min 1s ago
Cntrl PID: 610 (example-key-pre)
         IP: 0B in, 0B out
         IO: 116.0K read, 0B written
      Tasks: 4 (limit: 2374)
     Memory: 1.6M
        CPU: 3ms
     CGroup: /system.slice/example-key.service
             ├─610 /nix/store/kl6lr3czkbnr6m5crcy8ffwfzbj8a22i-bash-4.4-p23/bin/bash -e /nix/store/awx1zrics3cal8kd9c5d05xzp5ikazlk-unit-script-example-key-pre-start/bin/example-key-pre-start
             ├─619 /nix/store/kl6lr3czkbnr6m5crcy8ffwfzbj8a22i-bash-4.4-p23/bin/bash -e /nix/store/awx1zrics3cal8kd9c5d05xzp5ikazlk-unit-script-example-key-pre-start/bin/example-key-pre-start
             ├─620 /nix/store/kl6lr3czkbnr6m5crcy8ffwfzbj8a22i-bash-4.4-p23/bin/bash -e /nix/store/awx1zrics3cal8kd9c5d05xzp5ikazlk-unit-script-example-key-pre-start/bin/example-key-pre-start
             └─621 inotifywait -qm --format %f -e create,move /run/keys

Jan 20 20:56:05 pa systemd[1]: Starting example-key.service...

The service is blocked waiting for the keys to exist. We have to populate the keys with nixops send-keys:

$ nixops send-keys -d blog-example
pa> uploading key example...

Now when we check on example.service, we get the following:

$ nixops ssh -d blog-example pa -- systemctl status example
● example.service
     Loaded: loaded (/nix/store/j4a8f6mnaw3v4sz7dqlnz95psh72xglw-unit-example.service/example.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Wed 2021-01-20 21:00:24 UTC; 32s ago
    Process: 954 ExecStart=/nix/store/1yg89z4dsdp1axacqk07iq5jqv58q169-unit-script-example-start/bin/example-start (code=exited, status=0/SUCCESS)
   Main PID: 954 (code=exited, status=0/SUCCESS)
         IP: 0B in, 0B out
        CPU: 3ms

Jan 20 21:00:24 pa example-start[957]:   File: /run/keys/example
Jan 20 21:00:24 pa example-start[957]:   Size: 31                Blocks: 8          IO Block: 4096   regular file
Jan 20 21:00:24 pa example-start[957]: Device: 18h/24d        Inode: 27774       Links: 1
Jan 20 21:00:24 pa example-start[957]: Access: (0400/-r--------)  Uid: (  998/ example)   Gid: (   96/    keys)
Jan 20 21:00:24 pa example-start[957]: Access: 2021-01-20 21:00:24.588494730 +0000
Jan 20 21:00:24 pa example-start[957]: Modify: 2021-01-20 21:00:24.588494730 +0000
Jan 20 21:00:24 pa example-start[957]: Change: 2021-01-20 21:00:24.606495751 +0000
Jan 20 21:00:24 pa example-start[957]:  Birth: -
Jan 20 21:00:24 pa systemd[1]: example.service: Succeeded.
Jan 20 21:00:24 pa systemd[1]: Finished example.service.

This means that NixOps secrets require manual human intervention in order to repopulate them on server boot. If your server went offline overnight due to an unexpected issue, your services using those keys could be stuck offline until morning. This is undesirable for a number of reasons. This plus the requirement for the keys group (which at time of writing was undocumented) to be added to service user accounts means that while they do work, they are not very ergonomic.

You can read secrets from files using something like deployment.keys.example.text = "${builtins.readFile ./secrets/example.env}", but it is kind of a pain to have to do that. It would be better to just reference the secrets by filesystem paths in the first place.

On the other hand Morph gets this a bit better. It is sadly even less documented than NixOps is, but it offers a similar experience via deployment secrets. The main differences that Morph brings to the table are taking paths to secrets and allowing you to run an arbitrary command on the secret being uploaded. Secrets are also able to be put anywhere on the disk, meaning that when a host reboots it will come back up with the most recent secrets uploaded to it.

However, like NixOps, Morph secrets don't have the ability to be rolled back. This means that if you mess up a secret value you better hope you have the old information somewhere. This violates what you'd expect from a NixOS machine.

So given these examples, I thought it would be interesting to explore what the middle path could look like. I chose to use age for encrypting secrets in the Nix store as well as using SSH host keys to ensure that every secret is decryptable at runtime by that machine only. If you get your hands on the secret cyphertext, it should be unusable to you.

One of the harder things here will be keeping a list of all of the server host keys. Recently I added a hosts.toml file to my config repo for autoconfiguring my WireGuard overlay network. It was easy enough to add all the SSH host keys for each machine using a command like this to get them:

We will cover how this WireGuard overlay works in a future post.

$ nixops ssh-for-each -d hexagone -- cat /etc/ssh/ssh_host_ed25519_key.pub 
firgu....> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIB8+mCR+MEsv0XYi7ohvdKLbDecBtb3uKGQOPfIhdj3C root@nixos
chrysalis> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGDA5iXvkKyvAiMEd/5IruwKwoymC8WxH4tLcLWOSYJ1 root@chrysalis
lufta....> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMADhGV0hKt3ZY+uBjgOXX08txBS6MmHZcSL61KAd3df root@lufta
keanu....> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGDZUmuhfjEIROo2hog2c8J53taRuPJLNOtdaT8Nt69W root@nixos

age lets you use SSH keys for decryption, so I added these keys to my hosts.toml and ended up with something like this.

Now we can encrypt secrets on the host machine and safely put them in the Nix store because they will be readable to each target machine with a command like this:

age -d -i /etc/ssh/ssh_host_ed25519_key -o $dest $src

From here it's easy to make a function that we can use for generating new encrypted secrets in the Nix store. First we need to import the host metadata from the toml file:

let
  cfg = config.within.secrets;
  metadata = lib.importTOML ../../ops/metadata/hosts.toml;

  mkSecretOnDisk = name:
    { source, ... }:
    pkgs.stdenv.mkDerivation {
      name = "${name}-secret";
      phases = "installPhase";
      buildInputs = [ pkgs.age ];
      installPhase =
        let key = metadata.hosts."${config.networking.hostName}".ssh_pubkey;
        in ''
          age -a -r "${key}" -o $out ${source}
        '';
    };

And then we can generate systemd oneshot jobs with something like this:

  mkService = name:
    { source, dest, owner, group, permissions, ... }: {
      description = "decrypt secret for ${name}";
      wantedBy = [ "multi-user.target" ];

      serviceConfig.Type = "oneshot";

      script = with pkgs; ''
        rm -rf ${dest}
        ${age}/bin/age -d -i /etc/ssh/ssh_host_ed25519_key -o ${dest} ${
          mkSecretOnDisk name { inherit source; }
        }

        chown ${owner}:${group} ${dest}
        chmod ${permissions} ${dest}
      '';
    };

And from there we just need some boring boilerplate to define a secret type. Then we declare the secret type and its invocation:

in {
  options.within.secrets = mkOption {
    type = types.attrsOf secret;
    description = "secret configuration";
    default = { };
  };

  config.systemd.services = let
    units = mapAttrs' (name: info: {
      name = "${name}-key";
      value = (mkService name info);
    }) cfg;
  in units;
}

And we have ourself a NixOS module that allows us to:

  • Trivially declare new secrets
  • Make secrets in the Nix store useless without the key
  • Make every secret be transparently decrypted on startup
  • Avoid the use of GPG
  • Roll back secrets like any other configuration change

Declaring new secrets works like this (as stolen from the service definition for the website you are reading right now):

within.secrets.example = {
  source = ./secrets/example.env;
  dest = "/var/lib/example/.env";
  owner = "example";
  group = "nogroup";
  permissions = "0400";
};

Barring some kind of cryptographic attack against age, this should allow the secrets to be stored securely. I am working on a way to make this more generic. This overall approach was inspired by agenix but made more specific for my needs. I hope this approach will make it easy for me to manage these secrets in the future.