xesite/prometheus-grafana-loki-nixos-2020-11-20.markdown at 9f977b388223d2bb87cd6d63980379c7ed96b218

14 KiB

Raw Blame History

title

date

Grafana

Grafana is a service that handles graphing and alerting. It also has some nice tools to create dashboards. Here we will be using it for a few main purposes:

Exploring what metrics are available
Reading system logs
Making graphs and dashboards
Creating alerts over metrics or lack of metrics

Let's configure Grafana on a machine. Open that machine's configuration.nix in an editor and add the following to it:

# hosts/chrysalis/configuration.nix
{ config, pkgs, ... }: {
  # grafana configuration
  services.grafana = {
    enable = true;
    domain = "grafana.pele";
    port = 2342;
    addr = "127.0.0.1";
  };
  
  # nginx reverse proxy
  services.nginx.virtualHosts.${config.services.grafana.domain} = {
    locations."/" = {
        proxyPass = "http://127.0.0.1:${toString config.services.grafana.port}";
        proxyWebsockets = true;
    };
  };
}

If you have a custom TLS Certificate Authority, you can set up HTTPS for this deployment. See here for an example of doing this. If this server is exposed to the internet, you can use a certificate from Let's Encrypt instead of your own Certificate Authority.

Then you will need to deploy it to your cluster with nixops deploy:

$ nixops deploy -d home

Now open the Grafana server in your browser at http://grafana.pele and login with the super secure default credentials of admin/admin. Grafana will ask you to change your password. Please change it to something other than admin.

This is all of the setup we will do with Grafana for now. We will come back to it later.

Prometheus

Prometheus was punished by the gods by giving the gift of knowledge to man. He was cast into the bowels of the earth and pecked by birds. Oracle Turret, Portal 2

Prometheus is a service that reads metrics from other services, stores them and allows you to search and aggregate them. Let's add it to our configuration.nix file:

# hosts/chrysalis/configuration.nix
  services.prometheus = {
    enable = true;
    port = 9001;
  };

Now let's deploy this config to the cluster with nixops deploy:

$ nixops deploy -d home

And let's configure Grafana to read from Prometheus. Open Grafana and click on the gear to the left side of the page. The Data Sources tab should be active. If it is not active, click on Data Sources. Then click "add data source" and choose Prometheus. Set the URL to http://127.0.0.1:9001 (or with whatever port you configured above) and leave everything set to the default values. Click "Save & Test". If there is an error, be sure to check the port number.

Now let's start getting some data into Prometheus with the node exporter.

Node Exporter Setup

The Prometheus node exporter exposes a lot of information about systems ranging from memory, disk usage and even systemd service information. There are also some other collectors you can set up based on your individual setup, however we are going to enable only the node collector here.

In your configuration.nix, add an exporters block and configure the node exporter under services.prometheus:

# hosts/chrysalis/configuration.nix
  services.prometheus = {
    exporters = {
      node = {
        enable = true;
        enabledCollectors = [ "systemd" ];
        port = 9002;
      };
    };
  }

Now we need to configure Prometheus to read metrics from this exporter. In your configuration.nix, add a scrapeConfigs block under services.prometheus that points to the node exporter we configured just now:

# hosts/chrysalis/configuration.nix
  services.prometheus = {
    # ...
    
    scrapeConfigs = [
      {
        job_name = "chrysalis";
        static_configs = [{
          targets = [ "127.0.0.1:${toString config.services.prometheus.exporters.node.port}" ];
        }];
      }
    ];
    
    # ...
  }
  
  # ...

The complicated expression in the target above allows you to change the port of the node exporter and ensure that Prometheus will always be pointing at the right port!

Now we can deploy this to your cluster with nixops:

$ nixops deploy -d home

Open the Explore tab in Grafana and type in the following expression:

node_memory_MemFree_bytes

and hit shift-enter (or click the "Run Query" button in the upper left side of the screen). You should see a graph showing you the amount of ram that is free on the host, something like this:

If you want to query other fields, you can type in node_ into the searchbox and autocomplete will show what is available. For a full list of what is available, open the node exporter metrics route in your browser and look through it.

Grafana Dashboards

Now that we have all of this information about our machine, let's create a little dashboard for it and set up a few alerts.

Click on the plus icon on the left side of the Grafana UI to create a new dashboard. It will look something like this:

In Grafana terminology, everything you see in a dashboard is inside a panel. Let's create a new panel to keep track of memory usage for our server. Click "Add New Panel" and you will get a screen that looks like this:

Let's make this keep track of free memory. Write "Memory Free" in the panel title field on the right. Write the following query in the textbox next to the dropdown labeled "Metrics":

node_memory_MemFree_bytes

and set the legend to {{job}}. You should get a graph that looks something like this:

This will show you how much memory is free on each machine you are monitoring with Prometheus' node exporter. Now let's configure an alert for the amount of free memory being low (where "low" means less than 64 megabytes of ram free).

Hit save in the upper right corner of the Grafana UI and give your dashboard a name, such as "Home Cluster Status". Now open the "Memory Free" panel for editing (click on the name and then click "Edit"), click the "Alert" tab, and click the "Create Alert" button. Let's configure it to do the following:

Check if free memory gets below 64 megabytes (64000000 bytes)
Send the message "Running out of memory!" when the alert fires

You can do that with a configuration like this:

Save the changes to apply this config.

Wait a minute. Where will this alert go to?

It will only show up on the alerts page:

But we can add a notification channel to customize this. Click on the Notification Channels tab and then click "New Channel". It should look something like this:

You can send notifications to many services, but let's send one to Discord this time. Acquire a Discord webhook link from somewhere and paste it in the Webhook URL field. Name it something like "Discord". It may also be a good idea to make this the default notification channel using the "Default" checkbox under the Notification Settings, so that our existing alert will show up in Discord when the system runs out of memory.

You can configure other alerts like this so you can monitor any other node metrics you want.

You can also monitor for the lack of data on particular metrics. If something that should always be reported suddenly isn't reported, it may be a good indicator that a server went down. You can also add other services to your scrapeConfigs settings so you can monitor things that expose metrics to Prometheus at /metrics.

Now that we have metrics configured, let's enable Loki for logging.

Loki

Loki is a log aggregator created by the people behind Grafana. Here we will use it as a target for all system logs. Unfortunately, the Loki NixOS module is very basic at the moment, so we will need to configure it with our own custom yaml file. Create a file in your configuration.nix folder called loki.yaml and copy in the config from this gist:

Then enable Loki with your config in your configuration.nix file:

# hosts/chrysalis/configuration.nix
  services.loki = {
    enable = true;
    configFile = ./loki-local-config.yaml;
  };

Promtail is a tool made by the Loki team that sends logs into Loki. Create a file called promtail.yaml in the same folder as configuration.nix with the following contents:

server:
  http_listen_port: 28183
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://127.0.0.1:3100/loki/api/v1/push

scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: chrysalis
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

Now we can add promtail to your configuration.nix by creating a systemd service to run it with this snippet:

# hosts/chrysalis/configuration.nix
  systemd.services.promtail = {
    description = "Promtail service for Loki";
    wantedBy = [ "multi-user.target" ];

    serviceConfig = {
      ExecStart = ''
        ${pkgs.grafana-loki}/bin/promtail --config.file ${./promtail.yaml}
      '';
    };
  };

Now that you have this all set up, you can push this to your cluster with nixops:

$ nixops deploy -d home

Once that finishes, open up Grafana and configure a new Loki data source with the URL http://127.0.0.1:3100:

Now that you have Loki set up, let's query it! Open the Explore view in Grafana again, choose Loki as the source, and enter in the query {job="systemd-journal"}:

You can also add Loki queries like this to dashboards! Loki also lets you query by systemd unit with the unit field. If you wanted to search for logs from foo.service, you would need a query that looks something like {job="systemd-journal", unit="foo.service"} You can do many more complicated things with Loki. Look here for more information on what you can query. As of the time of writing this blogpost, you are currently unable to make Grafana alerts based on Loki queries as far as I am aware.

This barely scrapes the surface of what you can accomplish with a setup like this. Using more fancy setups you can alert on the rate of metrics changing. I plan to make NixOS modules to make this setup easier in the future. There is also a set of options in services.grafana.provision that can make it easier to automagically set up Grafana with per-host dashboards, alerts and all of the data sources that are outlined in this post.

The setup in this post is quite meager, but it should be enough to get you started with whatever you need to monitor. Adding Prometheus metrics to your services will go a long way in terms of being able to better monitor things in production, do not be afraid to experiment!

14 KiB Raw Blame History