Prometheus grafana loki nixos (#266)

* prometheus-grafana-loki-nixos post * simple fix
2020-11-20 17:36:51 -05:00 · 2020-11-20 17:36:51 -05:00 · f5a86eafb8
parent 2dde44763d
commit f5a86eafb8
2 changed files with 397 additions and 2 deletions
--- a/blog/prometheus-grafana-loki-nixos-2020-11-20.markdown
+++ b/blog/prometheus-grafana-loki-nixos-2020-11-20.markdown
@ -0,0 +1,395 @@
+---
+title: How to Setup Prometheus, Grafana and Loki on NixOS
+date: 2020-11-20
+tags:
+  - nixos
+  - prometheus
+  - grafana
+  - loki
+  - promtail
+---
+
+# How to Setup Prometheus, Grafana and Loki on NixOS
+
+When setting up services on your home network, sometimes you have questions
+along the lines of "how do I know that things are working?". In this blogpost we
+will go over a few tools that you can use to monitor and visualize your machine
+state so you can answer that. Specifically we are going to use the following
+tools to do this:
+
+- [Grafana](https://grafana.com/) for creating pretty graphs and managing
+  alerts
+- [Prometheus](https://prometheus.io/) for storing metrics and as a common
+  metrics format
+- [Prometheus node_exporter](https://github.com/prometheus/node_exporter) for
+  deriving metrics from system state
+- [Loki](https://grafana.com/oss/loki/) as a central log storage point
+- [promtail](https://grafana.com/docs/loki/latest/clients/promtail/) to push
+  logs to Loki
+  
+Let's get going!
+
+[Something to note: in here you might see domains using the `.pele` top-level
+domain. This domain will likely not be available on your home network. See <a
+href="/blog/series/site-to-site-wireguard">this series</a> on how to set up
+something similar for your home network. If you don't have such a setup, replace
+anything that ends in `.pele` with whatever you normally use for
+this.](conversation://Mara/hacker)
+
+## Grafana
+
+Grafana is a service that handles graphing and alerting. It also has some nice
+tools to create dashboards. Here we will be using it for a few main purposes:
+
+- Exploring what metrics are available
+- Reading system logs
+- Making graphs and dashboards
+- Creating alerts over metrics or lack of metrics
+
+Let's configure Grafana on a machine. Open that machine's `configuration.nix` in
+an editor and add the following to it:
+
+```nix
+# hosts/chrysalis/configuration.nix
+{ config, pkgs, ... }: {
+  # grafana configuration
+  services.grafana = {
+    enable = true;
+    domain = "grafana.pele";
+    port = 2342;
+    addr = "127.0.0.1";
+  };
+  
+  # nginx reverse proxy
+  services.nginx.virtualHosts.${services.grafana.domain} = {
+    locations."/" = {
+        proxyPass = "http://127.0.0.1:${toString config.services.grafana.port}";
+        proxyWebsockets = true;
+    };
+  };
+}
+```
+
+[If you have a <a href="/blog/site-to-site-wireguard-part-3-2019-04-11">custom
+TLS Certificate Authority</a>, you can set up HTTPS for this deployment. See <a
+href="https://github.com/Xe/nixos-configs/blob/master/common/sites/grafana.akua.nix">here</a>
+for an example of doing this. If this server is exposed to the internet, you can
+use a certificate from <a
+href="https://nixos.wiki/wiki/Nginx#TLS_reverse_proxy">Let's Encrypt</a> instead
+of your own Certificate Authority.](conversation://Mara/hacker)
+
+Then you will need to deploy it to your cluster with `nixops deploy`:
+
+```console
+$ nixops deploy -d home
+```
+
+Now open the Grafana server in your browser at http://grafana.pele and login
+with the super secure default credentials of admin/admin. Grafana will ask you
+to change your password. Please change it to something other than admin.
+
+This is all of the setup we will do with Grafana for now. We will come back to
+it later.
+
+## Prometheus
+
+> Prometheus was punished by the gods by giving the gift of knowledge to man. He
+> was cast into the bowels of the earth and pecked by birds.
+Oracle Turret, Portal 2
+
+Prometheus is a service that reads metrics from other services, stores them and
+allows you to search and aggregate them. Let's add it to our `configuration.nix`
+file:
+
+```nix
+# hosts/chrysalis/configuration.nix
+  services.prometheus = {
+    enable = true;
+    port = 9001;
+  };
+```
+
+Now let's deploy this config to the cluster with `nixops deploy`:
+
+```console
+$ nixops deploy -d home
+```
+
+And let's configure Grafana to read from Prometheus. Open Grafana and click on
+the gear to the left side of the page. The `Data Sources` tab should be active.
+If it is not active, click on `Data Sources`. Then click "add data source" and
+choose Prometheus. Set the URL to `http://127.0.0.1:9001` (or with whatever port
+you configured above) and leave everything set to the default values. Click
+"Save & Test". If there is an error, be sure to check the port number.
+
+![The Grafana UI for adding a data
+source](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_145819.png)
+
+Now let's start getting some data into Prometheus with the node exporter.
+
+### Node Exporter Setup
+
+The Prometheus node exporter exposes a lot of information about systems ranging
+from memory, disk usage and even systemd service information. There are also
+some [other
+collectors](https://search.nixos.org/options?channel=20.09&query=prometheus.exporters+enable)
+you can set up based on your individual setup, however we are going to enable
+only the node collector here.
+
+In your `configuration.nix`, add an exporters block and configure the node
+exporter under `services.prometheus`:
+
+```nix
+# hosts/chrysalis/configuration.nix
+  services.prometheus = {
+    exporters = {
+      node = {
+        enable = true;
+        enabledCollectors = [ "systemd" ];
+        port = 9001;
+      };
+    };
+  }
+```
+
+Now we need to configure Prometheus to read metrics from this exporter. In your
+`configuration.nix`, add a `scrapeConfigs` block under `services.prometheus`
+that points to the node exporter we configured just now:
+
+```nix
+# hosts/chrysalis/configuration.nix
+  services.prometheus = {
+    # ...
+    
+    scrapeConfigs = [
+      {
+        job_name = "chrysalis";
+        static_configs = [
+          targets = [ "127.0.0.1:${toString config.services.prometheus.exporters.node.port}" ];
+        ];
+      }
+    ];
+    
+    # ...
+  }
+  
+  # ...
+```
+
+[The complicated expression in the target above allows you to change the port of
+the node exporter and ensure that Prometheus will always be pointing at the
+right port!](conversation://Mara/hacker)
+
+Now we can deploy this to your cluster with nixops:
+
+```console
+$ nixops deploy -d home
+```
+
+Open the Explore tab in Grafana and type in the following expression:
+
+```
+node_memory_MemFree_bytes
+```
+
+and hit shift-enter (or click the "Run Query" button in the upper left side of
+the screen). You should see a graph showing you the amount of ram that is free
+on the host, something like this:
+
+![A graph of the amount of system memory that is available on the host
+chrysalis](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_150328.png)
+
+If you want to query other fields, you can type in `node_` into the searchbox
+and autocomplete will show what is available. For a full list of what is
+available, open the node exporter metrics route in your browser and look through
+it.
+
+## Grafana Dashboards
+
+Now that we have all of this information about our machine, let's create a
+little dashboard for it and set up a few alerts.
+
+Click on the plus icon on the left side of the Grafana UI to create a new
+dashboard. It will look something like this:
+
+![An empty dashboard in
+Grafana](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151205.png)
+
+In Grafana terminology, everything you see in a dashboard is inside a panel.
+Let's create a new panel to keep track of memory usage for our server. Click
+"Add New Panel" and you will get a screen that looks like this:
+
+![A Grafana panel configuration
+screen](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151609.png)
+
+Let's make this keep track of free memory. Write "Memory Free" in the panel
+title field on the right. Write the following query in the textbox next to the
+dropdown labeled "Metrics":
+
+```
+node_memory_MemFree_bytes
+```
+
+and set the legend to `{{job}}`. You should get a graph that looks something
+like this:
+
+![A populated
+graph](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_152126.png)
+
+This will show you how much memory is free on each machine you are monitoring
+with Prometheus' node exporter. Now let's configure an alert for the amount of
+free memory being low (where "low" means less than 64 megabytes of ram free).
+
+Hit save in the upper right corner of the Grafana UI and give your dashboard a
+name, such as "Home Cluster Status". Now open the "Memory Free" panel for
+editing (click on the name and then click "Edit"), click the "Alert" tab, and
+click the "Create Alert" button. Let's configure it to do the following:
+
+- Check if free memory gets below 64 megabytes (64000000 bytes)
+- Send the message "Running out of memory!" when the alert fires
+
+You can do that with a configuration like this:
+
+![The above configuration input to the Grafana
+UI](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_153419.png)
+
+Save the changes to apply this config.
+
+[Wait a minute. Where will this alert go to?](conversation://Mara/hmm)
+
+It will only show up on the alerts page:
+
+![The alerts page with memory free alerts
+configured](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154027.png)
+
+But we can add a notification channel to customize this. Click on the
+Notification Channels tab and then click "New Channel". It should look something
+like this:
+
+![Notification Channel
+configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154317.png)
+
+You can send notifications to many services, but let's send one to Discord this
+time. Acquire a Discord webhook link from somewhere and paste it in the Webhook
+URL field. Name it something like "Discord". It may also be a good idea to make
+this the default notification channel using the "Default" checkbox under the
+Notification Settings, so that our existing alert will show up in Discord when
+the system runs out of memory.
+
+You can configure other alerts like this so you can monitor any other node
+metrics you want.
+
+[You can also monitor for the _lack_ of data on particular metrics. If something
+that should always be reported suddenly isn't reported, it may be a good
+indicator that a server went down. You can also add other services to your
+`scrapeConfigs` settings so you can monitor things that expose metrics to
+Prometheus at `/metrics`.](conversation://Mara/hacker)
+
+Now that we have metrics configured, let's enable Loki for logging.
+
+## Loki
+
+Loki is a log aggregator created by the people behind Grafana. Here we will use
+it as a target for all system logs. Unfortunately, the Loki NixOS module is very
+basic at the moment, so we will need to configure it with our own custom yaml
+file. Create a file in your `configuration.nix` folder called `loki.yaml` and
+copy in the config from [this
+gist](https://gist.github.com/Xe/c3c786b41ec2820725ee77a7af551225):
+
+Then enable Loki with your config in your `configuration.nix` file:
+
+```nix
+# hosts/chrysalis/configuration.nix
+  services.loki = {
+    enable = true;
+    configFile = ./loki-local-config.yaml;
+  };
+```
+
+Promtail is a tool made by the Loki team that sends logs into Loki. Create a
+file called `promtail.yaml` in the same folder as `configuration.nix` with the
+following contents:
+
+```yaml
+server:
+  http_listen_port: 28183
+  grpc_listen_port: 0
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://127.0.0.1:3100/loki/api/v1/push
+
+scrape_configs:
+  - job_name: journal
+    journal:
+      max_age: 12h
+      labels:
+        job: systemd-journal
+        host: chrysalis
+    relabel_configs:
+      - source_labels: ['__journal__systemd_unit']
+        target_label: 'unit'
+```
+
+Now we can add promtail to your `configuration.nix` by creating a systemd
+service to run it with this snippet:
+
+```nix
+# hosts/chrysalis/configuration.nix
+  systemd.services.promtail = {
+    description = "Promtail service for Loki";
+    wantedBy = [ "multi-user.target" ];
+
+    serviceConfig = {
+      ExecStart = ''
+        ${pkgs.grafana-loki}/bin/promtail --config.file ${./promtail.yaml}
+      '';
+    };
+  };
+```
+
+Now that you have this all set up, you can push this to your cluster with
+nixops:
+
+```console
+$ nixops deploy -d home
+```
+
+Once that finishes, open up Grafana and configure a new Loki data source with
+the URL `http://127.0.0.1:3100`:
+
+![Loki Data Source
+configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_161610.png)
+
+Now that you have Loki set up, let's query it! Open the Explore view in Grafana
+again, choose Loki as the source, and enter in the query `{job="systemd-journal"}`:
+
+![Loki
+search](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_162043.png)
+
+[You can also add Loki queries like this to dashboards! Loki also lets you query by
+systemd unit with the `unit` field. If you wanted to search for logs from
+`foo.service`, you would need a query that looks something like
+`{job="systemd-journal", unit="foo.service"}` You can do many more complicated
+things with Loki. Look <a
+href="https://grafana.com/docs/grafana/latest/datasources/loki/#search-expression">here
+</a> for more information on what you can query. As of the time of writing this
+blogpost, you are currently unable to make Grafana alerts based on Loki queries
+as far as I am aware.](conversation://Mara/hacker)
+
+---
+
+This barely scrapes the surface of what you can accomplish with a setup like
+this. Using more fancy setups you can alert on the rate of metrics changing. I
+plan to make NixOS modules to make this setup easier in the future. There is
+also a set of options in
+[services.grafana.provision](https://search.nixos.org/options?channel=20.09&from=0&size=30&sort=relevance&query=grafana.provision)
+that can make it easier to automagically set up Grafana with per-host
+dashboards, alerts and all of the data sources that are outlined in this post.
+
+The setup in this post is quite meager, but it should be enough to get you
+started with whatever you need to monitor. Adding Prometheus metrics to your
+services will go a long way in terms of being able to better monitor things in
+production, do not be afraid to experiment!
--- a/css/hack.css
+++ b/css/hack.css
@ -218,7 +218,7 @@ a:hover {
  overflow: hidden;
 }
 .hack h1:after {
-  content: "====================================================================================================";
+  content: "===============================================================================================================================================================";
  position: absolute;
  bottom: 10px;
  left: 0;
@ -315,7 +315,7 @@ a:hover {
  margin: 20px 0;
 }
 .hack hr:after {
-  content: "----------------------------------------------------------------------------------------------------";
+  content: "---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";
  position: absolute;
  top: 0;
  left: 0;