396 lines
14 KiB
Markdown
396 lines
14 KiB
Markdown
|
---
|
||
|
title: How to Setup Prometheus, Grafana and Loki on NixOS
|
||
|
date: 2020-11-20
|
||
|
tags:
|
||
|
- nixos
|
||
|
- prometheus
|
||
|
- grafana
|
||
|
- loki
|
||
|
- promtail
|
||
|
---
|
||
|
|
||
|
# How to Setup Prometheus, Grafana and Loki on NixOS
|
||
|
|
||
|
When setting up services on your home network, sometimes you have questions
|
||
|
along the lines of "how do I know that things are working?". In this blogpost we
|
||
|
will go over a few tools that you can use to monitor and visualize your machine
|
||
|
state so you can answer that. Specifically we are going to use the following
|
||
|
tools to do this:
|
||
|
|
||
|
- [Grafana](https://grafana.com/) for creating pretty graphs and managing
|
||
|
alerts
|
||
|
- [Prometheus](https://prometheus.io/) for storing metrics and as a common
|
||
|
metrics format
|
||
|
- [Prometheus node_exporter](https://github.com/prometheus/node_exporter) for
|
||
|
deriving metrics from system state
|
||
|
- [Loki](https://grafana.com/oss/loki/) as a central log storage point
|
||
|
- [promtail](https://grafana.com/docs/loki/latest/clients/promtail/) to push
|
||
|
logs to Loki
|
||
|
|
||
|
Let's get going!
|
||
|
|
||
|
[Something to note: in here you might see domains using the `.pele` top-level
|
||
|
domain. This domain will likely not be available on your home network. See <a
|
||
|
href="/blog/series/site-to-site-wireguard">this series</a> on how to set up
|
||
|
something similar for your home network. If you don't have such a setup, replace
|
||
|
anything that ends in `.pele` with whatever you normally use for
|
||
|
this.](conversation://Mara/hacker)
|
||
|
|
||
|
## Grafana
|
||
|
|
||
|
Grafana is a service that handles graphing and alerting. It also has some nice
|
||
|
tools to create dashboards. Here we will be using it for a few main purposes:
|
||
|
|
||
|
- Exploring what metrics are available
|
||
|
- Reading system logs
|
||
|
- Making graphs and dashboards
|
||
|
- Creating alerts over metrics or lack of metrics
|
||
|
|
||
|
Let's configure Grafana on a machine. Open that machine's `configuration.nix` in
|
||
|
an editor and add the following to it:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
{ config, pkgs, ... }: {
|
||
|
# grafana configuration
|
||
|
services.grafana = {
|
||
|
enable = true;
|
||
|
domain = "grafana.pele";
|
||
|
port = 2342;
|
||
|
addr = "127.0.0.1";
|
||
|
};
|
||
|
|
||
|
# nginx reverse proxy
|
||
|
services.nginx.virtualHosts.${services.grafana.domain} = {
|
||
|
locations."/" = {
|
||
|
proxyPass = "http://127.0.0.1:${toString config.services.grafana.port}";
|
||
|
proxyWebsockets = true;
|
||
|
};
|
||
|
};
|
||
|
}
|
||
|
```
|
||
|
|
||
|
[If you have a <a href="/blog/site-to-site-wireguard-part-3-2019-04-11">custom
|
||
|
TLS Certificate Authority</a>, you can set up HTTPS for this deployment. See <a
|
||
|
href="https://github.com/Xe/nixos-configs/blob/master/common/sites/grafana.akua.nix">here</a>
|
||
|
for an example of doing this. If this server is exposed to the internet, you can
|
||
|
use a certificate from <a
|
||
|
href="https://nixos.wiki/wiki/Nginx#TLS_reverse_proxy">Let's Encrypt</a> instead
|
||
|
of your own Certificate Authority.](conversation://Mara/hacker)
|
||
|
|
||
|
Then you will need to deploy it to your cluster with `nixops deploy`:
|
||
|
|
||
|
```console
|
||
|
$ nixops deploy -d home
|
||
|
```
|
||
|
|
||
|
Now open the Grafana server in your browser at http://grafana.pele and login
|
||
|
with the super secure default credentials of admin/admin. Grafana will ask you
|
||
|
to change your password. Please change it to something other than admin.
|
||
|
|
||
|
This is all of the setup we will do with Grafana for now. We will come back to
|
||
|
it later.
|
||
|
|
||
|
## Prometheus
|
||
|
|
||
|
> Prometheus was punished by the gods by giving the gift of knowledge to man. He
|
||
|
> was cast into the bowels of the earth and pecked by birds.
|
||
|
Oracle Turret, Portal 2
|
||
|
|
||
|
Prometheus is a service that reads metrics from other services, stores them and
|
||
|
allows you to search and aggregate them. Let's add it to our `configuration.nix`
|
||
|
file:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
services.prometheus = {
|
||
|
enable = true;
|
||
|
port = 9001;
|
||
|
};
|
||
|
```
|
||
|
|
||
|
Now let's deploy this config to the cluster with `nixops deploy`:
|
||
|
|
||
|
```console
|
||
|
$ nixops deploy -d home
|
||
|
```
|
||
|
|
||
|
And let's configure Grafana to read from Prometheus. Open Grafana and click on
|
||
|
the gear to the left side of the page. The `Data Sources` tab should be active.
|
||
|
If it is not active, click on `Data Sources`. Then click "add data source" and
|
||
|
choose Prometheus. Set the URL to `http://127.0.0.1:9001` (or with whatever port
|
||
|
you configured above) and leave everything set to the default values. Click
|
||
|
"Save & Test". If there is an error, be sure to check the port number.
|
||
|
|
||
|
![The Grafana UI for adding a data
|
||
|
source](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_145819.png)
|
||
|
|
||
|
Now let's start getting some data into Prometheus with the node exporter.
|
||
|
|
||
|
### Node Exporter Setup
|
||
|
|
||
|
The Prometheus node exporter exposes a lot of information about systems ranging
|
||
|
from memory, disk usage and even systemd service information. There are also
|
||
|
some [other
|
||
|
collectors](https://search.nixos.org/options?channel=20.09&query=prometheus.exporters+enable)
|
||
|
you can set up based on your individual setup, however we are going to enable
|
||
|
only the node collector here.
|
||
|
|
||
|
In your `configuration.nix`, add an exporters block and configure the node
|
||
|
exporter under `services.prometheus`:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
services.prometheus = {
|
||
|
exporters = {
|
||
|
node = {
|
||
|
enable = true;
|
||
|
enabledCollectors = [ "systemd" ];
|
||
|
port = 9001;
|
||
|
};
|
||
|
};
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Now we need to configure Prometheus to read metrics from this exporter. In your
|
||
|
`configuration.nix`, add a `scrapeConfigs` block under `services.prometheus`
|
||
|
that points to the node exporter we configured just now:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
services.prometheus = {
|
||
|
# ...
|
||
|
|
||
|
scrapeConfigs = [
|
||
|
{
|
||
|
job_name = "chrysalis";
|
||
|
static_configs = [
|
||
|
targets = [ "127.0.0.1:${toString config.services.prometheus.exporters.node.port}" ];
|
||
|
];
|
||
|
}
|
||
|
];
|
||
|
|
||
|
# ...
|
||
|
}
|
||
|
|
||
|
# ...
|
||
|
```
|
||
|
|
||
|
[The complicated expression in the target above allows you to change the port of
|
||
|
the node exporter and ensure that Prometheus will always be pointing at the
|
||
|
right port!](conversation://Mara/hacker)
|
||
|
|
||
|
Now we can deploy this to your cluster with nixops:
|
||
|
|
||
|
```console
|
||
|
$ nixops deploy -d home
|
||
|
```
|
||
|
|
||
|
Open the Explore tab in Grafana and type in the following expression:
|
||
|
|
||
|
```
|
||
|
node_memory_MemFree_bytes
|
||
|
```
|
||
|
|
||
|
and hit shift-enter (or click the "Run Query" button in the upper left side of
|
||
|
the screen). You should see a graph showing you the amount of ram that is free
|
||
|
on the host, something like this:
|
||
|
|
||
|
![A graph of the amount of system memory that is available on the host
|
||
|
chrysalis](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_150328.png)
|
||
|
|
||
|
If you want to query other fields, you can type in `node_` into the searchbox
|
||
|
and autocomplete will show what is available. For a full list of what is
|
||
|
available, open the node exporter metrics route in your browser and look through
|
||
|
it.
|
||
|
|
||
|
## Grafana Dashboards
|
||
|
|
||
|
Now that we have all of this information about our machine, let's create a
|
||
|
little dashboard for it and set up a few alerts.
|
||
|
|
||
|
Click on the plus icon on the left side of the Grafana UI to create a new
|
||
|
dashboard. It will look something like this:
|
||
|
|
||
|
![An empty dashboard in
|
||
|
Grafana](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151205.png)
|
||
|
|
||
|
In Grafana terminology, everything you see in a dashboard is inside a panel.
|
||
|
Let's create a new panel to keep track of memory usage for our server. Click
|
||
|
"Add New Panel" and you will get a screen that looks like this:
|
||
|
|
||
|
![A Grafana panel configuration
|
||
|
screen](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151609.png)
|
||
|
|
||
|
Let's make this keep track of free memory. Write "Memory Free" in the panel
|
||
|
title field on the right. Write the following query in the textbox next to the
|
||
|
dropdown labeled "Metrics":
|
||
|
|
||
|
```
|
||
|
node_memory_MemFree_bytes
|
||
|
```
|
||
|
|
||
|
and set the legend to `{{job}}`. You should get a graph that looks something
|
||
|
like this:
|
||
|
|
||
|
![A populated
|
||
|
graph](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_152126.png)
|
||
|
|
||
|
This will show you how much memory is free on each machine you are monitoring
|
||
|
with Prometheus' node exporter. Now let's configure an alert for the amount of
|
||
|
free memory being low (where "low" means less than 64 megabytes of ram free).
|
||
|
|
||
|
Hit save in the upper right corner of the Grafana UI and give your dashboard a
|
||
|
name, such as "Home Cluster Status". Now open the "Memory Free" panel for
|
||
|
editing (click on the name and then click "Edit"), click the "Alert" tab, and
|
||
|
click the "Create Alert" button. Let's configure it to do the following:
|
||
|
|
||
|
- Check if free memory gets below 64 megabytes (64000000 bytes)
|
||
|
- Send the message "Running out of memory!" when the alert fires
|
||
|
|
||
|
You can do that with a configuration like this:
|
||
|
|
||
|
![The above configuration input to the Grafana
|
||
|
UI](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_153419.png)
|
||
|
|
||
|
Save the changes to apply this config.
|
||
|
|
||
|
[Wait a minute. Where will this alert go to?](conversation://Mara/hmm)
|
||
|
|
||
|
It will only show up on the alerts page:
|
||
|
|
||
|
![The alerts page with memory free alerts
|
||
|
configured](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154027.png)
|
||
|
|
||
|
But we can add a notification channel to customize this. Click on the
|
||
|
Notification Channels tab and then click "New Channel". It should look something
|
||
|
like this:
|
||
|
|
||
|
![Notification Channel
|
||
|
configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154317.png)
|
||
|
|
||
|
You can send notifications to many services, but let's send one to Discord this
|
||
|
time. Acquire a Discord webhook link from somewhere and paste it in the Webhook
|
||
|
URL field. Name it something like "Discord". It may also be a good idea to make
|
||
|
this the default notification channel using the "Default" checkbox under the
|
||
|
Notification Settings, so that our existing alert will show up in Discord when
|
||
|
the system runs out of memory.
|
||
|
|
||
|
You can configure other alerts like this so you can monitor any other node
|
||
|
metrics you want.
|
||
|
|
||
|
[You can also monitor for the _lack_ of data on particular metrics. If something
|
||
|
that should always be reported suddenly isn't reported, it may be a good
|
||
|
indicator that a server went down. You can also add other services to your
|
||
|
`scrapeConfigs` settings so you can monitor things that expose metrics to
|
||
|
Prometheus at `/metrics`.](conversation://Mara/hacker)
|
||
|
|
||
|
Now that we have metrics configured, let's enable Loki for logging.
|
||
|
|
||
|
## Loki
|
||
|
|
||
|
Loki is a log aggregator created by the people behind Grafana. Here we will use
|
||
|
it as a target for all system logs. Unfortunately, the Loki NixOS module is very
|
||
|
basic at the moment, so we will need to configure it with our own custom yaml
|
||
|
file. Create a file in your `configuration.nix` folder called `loki.yaml` and
|
||
|
copy in the config from [this
|
||
|
gist](https://gist.github.com/Xe/c3c786b41ec2820725ee77a7af551225):
|
||
|
|
||
|
Then enable Loki with your config in your `configuration.nix` file:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
services.loki = {
|
||
|
enable = true;
|
||
|
configFile = ./loki-local-config.yaml;
|
||
|
};
|
||
|
```
|
||
|
|
||
|
Promtail is a tool made by the Loki team that sends logs into Loki. Create a
|
||
|
file called `promtail.yaml` in the same folder as `configuration.nix` with the
|
||
|
following contents:
|
||
|
|
||
|
```yaml
|
||
|
server:
|
||
|
http_listen_port: 28183
|
||
|
grpc_listen_port: 0
|
||
|
|
||
|
positions:
|
||
|
filename: /tmp/positions.yaml
|
||
|
|
||
|
clients:
|
||
|
- url: http://127.0.0.1:3100/loki/api/v1/push
|
||
|
|
||
|
scrape_configs:
|
||
|
- job_name: journal
|
||
|
journal:
|
||
|
max_age: 12h
|
||
|
labels:
|
||
|
job: systemd-journal
|
||
|
host: chrysalis
|
||
|
relabel_configs:
|
||
|
- source_labels: ['__journal__systemd_unit']
|
||
|
target_label: 'unit'
|
||
|
```
|
||
|
|
||
|
Now we can add promtail to your `configuration.nix` by creating a systemd
|
||
|
service to run it with this snippet:
|
||
|
|
||
|
```nix
|
||
|
# hosts/chrysalis/configuration.nix
|
||
|
systemd.services.promtail = {
|
||
|
description = "Promtail service for Loki";
|
||
|
wantedBy = [ "multi-user.target" ];
|
||
|
|
||
|
serviceConfig = {
|
||
|
ExecStart = ''
|
||
|
${pkgs.grafana-loki}/bin/promtail --config.file ${./promtail.yaml}
|
||
|
'';
|
||
|
};
|
||
|
};
|
||
|
```
|
||
|
|
||
|
Now that you have this all set up, you can push this to your cluster with
|
||
|
nixops:
|
||
|
|
||
|
```console
|
||
|
$ nixops deploy -d home
|
||
|
```
|
||
|
|
||
|
Once that finishes, open up Grafana and configure a new Loki data source with
|
||
|
the URL `http://127.0.0.1:3100`:
|
||
|
|
||
|
![Loki Data Source
|
||
|
configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_161610.png)
|
||
|
|
||
|
Now that you have Loki set up, let's query it! Open the Explore view in Grafana
|
||
|
again, choose Loki as the source, and enter in the query `{job="systemd-journal"}`:
|
||
|
|
||
|
![Loki
|
||
|
search](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_162043.png)
|
||
|
|
||
|
[You can also add Loki queries like this to dashboards! Loki also lets you query by
|
||
|
systemd unit with the `unit` field. If you wanted to search for logs from
|
||
|
`foo.service`, you would need a query that looks something like
|
||
|
`{job="systemd-journal", unit="foo.service"}` You can do many more complicated
|
||
|
things with Loki. Look <a
|
||
|
href="https://grafana.com/docs/grafana/latest/datasources/loki/#search-expression">here
|
||
|
</a> for more information on what you can query. As of the time of writing this
|
||
|
blogpost, you are currently unable to make Grafana alerts based on Loki queries
|
||
|
as far as I am aware.](conversation://Mara/hacker)
|
||
|
|
||
|
---
|
||
|
|
||
|
This barely scrapes the surface of what you can accomplish with a setup like
|
||
|
this. Using more fancy setups you can alert on the rate of metrics changing. I
|
||
|
plan to make NixOS modules to make this setup easier in the future. There is
|
||
|
also a set of options in
|
||
|
[services.grafana.provision](https://search.nixos.org/options?channel=20.09&from=0&size=30&sort=relevance&query=grafana.provision)
|
||
|
that can make it easier to automagically set up Grafana with per-host
|
||
|
dashboards, alerts and all of the data sources that are outlined in this post.
|
||
|
|
||
|
The setup in this post is quite meager, but it should be enough to get you
|
||
|
started with whatever you need to monitor. Adding Prometheus metrics to your
|
||
|
services will go a long way in terms of being able to better monitor things in
|
||
|
production, do not be afraid to experiment!
|