Prometheus grafana loki nixos (#266)

* prometheus-grafana-loki-nixos post * simple fix
2020-11-20 17:36:51 -05:00 · 2020-11-20 17:36:51 -05:00 · f5a86eafb8
parent 2dde44763d
commit f5a86eafb8
2 changed files with 397 additions and 2 deletions
--- a/blog/prometheus-grafana-loki-nixos-2020-11-20.markdown
+++ b/blog/prometheus-grafana-loki-nixos-2020-11-20.markdown
@ -0,0 +1,395 @@
 ---
 title: How to Setup Prometheus, Grafana and Loki on NixOS
 date: 2020-11-20
 tags:
  - nixos
  - prometheus
  - grafana
  - loki
  - promtail
 ---
 # How to Setup Prometheus, Grafana and Loki on NixOS
 When setting up services on your home network, sometimes you have questions
 along the lines of "how do I know that things are working?". In this blogpost we
 will go over a few tools that you can use to monitor and visualize your machine
 state so you can answer that. Specifically we are going to use the following
 tools to do this:
 - [Grafana](https://grafana.com/) for creating pretty graphs and managing
  alerts
 - [Prometheus](https://prometheus.io/) for storing metrics and as a common
  metrics format
 - [Prometheus node_exporter](https://github.com/prometheus/node_exporter) for
  deriving metrics from system state
 - [Loki](https://grafana.com/oss/loki/) as a central log storage point
 - [promtail](https://grafana.com/docs/loki/latest/clients/promtail/) to push
  logs to Loki
 Let's get going!
 [Something to note: in here you might see domains using the `.pele` top-level
 domain. This domain will likely not be available on your home network. See <a
 href="/blog/series/site-to-site-wireguard">this series</a> on how to set up
 something similar for your home network. If you don't have such a setup, replace
 anything that ends in `.pele` with whatever you normally use for
 this.](conversation://Mara/hacker)
 ## Grafana
 Grafana is a service that handles graphing and alerting. It also has some nice
 tools to create dashboards. Here we will be using it for a few main purposes:
 - Exploring what metrics are available
 - Reading system logs
 - Making graphs and dashboards
 - Creating alerts over metrics or lack of metrics
 Let's configure Grafana on a machine. Open that machine's `configuration.nix` in
 an editor and add the following to it:
 ```nix
 # hosts/chrysalis/configuration.nix
 { config, pkgs, ... }: {
  # grafana configuration
  services.grafana = {
    enable = true;
    domain = "grafana.pele";
    port = 2342;
    addr = "127.0.0.1";
  };
  # nginx reverse proxy
  services.nginx.virtualHosts.${services.grafana.domain} = {
    locations."/" = {
        proxyPass = "http://127.0.0.1:${toString config.services.grafana.port}";
        proxyWebsockets = true;
    };
  };
 }
 ```
 [If you have a <a href="/blog/site-to-site-wireguard-part-3-2019-04-11">custom
 TLS Certificate Authority</a>, you can set up HTTPS for this deployment. See <a
 href="https://github.com/Xe/nixos-configs/blob/master/common/sites/grafana.akua.nix">here</a>
 for an example of doing this. If this server is exposed to the internet, you can
 use a certificate from <a
 href="https://nixos.wiki/wiki/Nginx#TLS_reverse_proxy">Let's Encrypt</a> instead
 of your own Certificate Authority.](conversation://Mara/hacker)
 Then you will need to deploy it to your cluster with `nixops deploy`:
 ```console
 $ nixops deploy -d home
 ```
 Now open the Grafana server in your browser at http://grafana.pele and login
 with the super secure default credentials of admin/admin. Grafana will ask you
 to change your password. Please change it to something other than admin.
 This is all of the setup we will do with Grafana for now. We will come back to
 it later.
 ## Prometheus
 > Prometheus was punished by the gods by giving the gift of knowledge to man. He
 > was cast into the bowels of the earth and pecked by birds.
 Oracle Turret, Portal 2
 Prometheus is a service that reads metrics from other services, stores them and
 allows you to search and aggregate them. Let's add it to our `configuration.nix`
 file:
 ```nix
 # hosts/chrysalis/configuration.nix
  services.prometheus = {
    enable = true;
    port = 9001;
  };
 ```
 Now let's deploy this config to the cluster with `nixops deploy`:
 ```console
 $ nixops deploy -d home
 ```
 And let's configure Grafana to read from Prometheus. Open Grafana and click on
 the gear to the left side of the page. The `Data Sources` tab should be active.
 If it is not active, click on `Data Sources`. Then click "add data source" and
 choose Prometheus. Set the URL to `http://127.0.0.1:9001` (or with whatever port
 you configured above) and leave everything set to the default values. Click
 "Save & Test". If there is an error, be sure to check the port number.
 ![The Grafana UI for adding a data
 source](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_145819.png)
 Now let's start getting some data into Prometheus with the node exporter.
 ### Node Exporter Setup
 The Prometheus node exporter exposes a lot of information about systems ranging
 from memory, disk usage and even systemd service information. There are also
 some [other
 collectors](https://search.nixos.org/options?channel=20.09&query=prometheus.exporters+enable)
 you can set up based on your individual setup, however we are going to enable
 only the node collector here.
 In your `configuration.nix`, add an exporters block and configure the node
 exporter under `services.prometheus`:
 ```nix
 # hosts/chrysalis/configuration.nix
  services.prometheus = {
    exporters = {
      node = {
        enable = true;
        enabledCollectors = [ "systemd" ];
        port = 9001;
      };
    };
  }
 ```
 Now we need to configure Prometheus to read metrics from this exporter. In your
 `configuration.nix`, add a `scrapeConfigs` block under `services.prometheus`
 that points to the node exporter we configured just now:
 ```nix
 # hosts/chrysalis/configuration.nix
  services.prometheus = {
    # ...
    scrapeConfigs = [
      {
        job_name = "chrysalis";
        static_configs = [
          targets = [ "127.0.0.1:${toString config.services.prometheus.exporters.node.port}" ];
        ];
      }
    ];
    # ...
  }
  # ...
 ```
 [The complicated expression in the target above allows you to change the port of
 the node exporter and ensure that Prometheus will always be pointing at the
 right port!](conversation://Mara/hacker)
 Now we can deploy this to your cluster with nixops:
 ```console
 $ nixops deploy -d home
 ```
 Open the Explore tab in Grafana and type in the following expression:
 ```
 node_memory_MemFree_bytes
 ```
 and hit shift-enter (or click the "Run Query" button in the upper left side of
 the screen). You should see a graph showing you the amount of ram that is free
 on the host, something like this:
 ![A graph of the amount of system memory that is available on the host
 chrysalis](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_150328.png)
 If you want to query other fields, you can type in `node_` into the searchbox
 and autocomplete will show what is available. For a full list of what is
 available, open the node exporter metrics route in your browser and look through
 it.
 ## Grafana Dashboards
 Now that we have all of this information about our machine, let's create a
 little dashboard for it and set up a few alerts.
 Click on the plus icon on the left side of the Grafana UI to create a new
 dashboard. It will look something like this:
 ![An empty dashboard in
 Grafana](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151205.png)
 In Grafana terminology, everything you see in a dashboard is inside a panel.
 Let's create a new panel to keep track of memory usage for our server. Click
 "Add New Panel" and you will get a screen that looks like this:
 ![A Grafana panel configuration
 screen](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_151609.png)
 Let's make this keep track of free memory. Write "Memory Free" in the panel
 title field on the right. Write the following query in the textbox next to the
 dropdown labeled "Metrics":
 ```
 node_memory_MemFree_bytes
 ```
 and set the legend to `{{job}}`. You should get a graph that looks something
 like this:
 ![A populated
 graph](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_152126.png)
 This will show you how much memory is free on each machine you are monitoring
 with Prometheus' node exporter. Now let's configure an alert for the amount of
 free memory being low (where "low" means less than 64 megabytes of ram free).
 Hit save in the upper right corner of the Grafana UI and give your dashboard a
 name, such as "Home Cluster Status". Now open the "Memory Free" panel for
 editing (click on the name and then click "Edit"), click the "Alert" tab, and
 click the "Create Alert" button. Let's configure it to do the following:
 - Check if free memory gets below 64 megabytes (64000000 bytes)
 - Send the message "Running out of memory!" when the alert fires
 You can do that with a configuration like this:
 ![The above configuration input to the Grafana
 UI](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_153419.png)
 Save the changes to apply this config.
 [Wait a minute. Where will this alert go to?](conversation://Mara/hmm)
 It will only show up on the alerts page:
 ![The alerts page with memory free alerts
 configured](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154027.png)
 But we can add a notification channel to customize this. Click on the
 Notification Channels tab and then click "New Channel". It should look something
 like this:
 ![Notification Channel
 configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_154317.png)
 You can send notifications to many services, but let's send one to Discord this
 time. Acquire a Discord webhook link from somewhere and paste it in the Webhook
 URL field. Name it something like "Discord". It may also be a good idea to make
 this the default notification channel using the "Default" checkbox under the
 Notification Settings, so that our existing alert will show up in Discord when
 the system runs out of memory.
 You can configure other alerts like this so you can monitor any other node
 metrics you want.
 [You can also monitor for the _lack_ of data on particular metrics. If something
 that should always be reported suddenly isn't reported, it may be a good
 indicator that a server went down. You can also add other services to your
 `scrapeConfigs` settings so you can monitor things that expose metrics to
 Prometheus at `/metrics`.](conversation://Mara/hacker)
 Now that we have metrics configured, let's enable Loki for logging.
 ## Loki
 Loki is a log aggregator created by the people behind Grafana. Here we will use
 it as a target for all system logs. Unfortunately, the Loki NixOS module is very
 basic at the moment, so we will need to configure it with our own custom yaml
 file. Create a file in your `configuration.nix` folder called `loki.yaml` and
 copy in the config from [this
 gist](https://gist.github.com/Xe/c3c786b41ec2820725ee77a7af551225):
 Then enable Loki with your config in your `configuration.nix` file:
 ```nix
 # hosts/chrysalis/configuration.nix
  services.loki = {
    enable = true;
    configFile = ./loki-local-config.yaml;
  };
 ```
 Promtail is a tool made by the Loki team that sends logs into Loki. Create a
 file called `promtail.yaml` in the same folder as `configuration.nix` with the
 following contents:
 ```yaml
 server:
  http_listen_port: 28183
  grpc_listen_port: 0
 positions:
  filename: /tmp/positions.yaml
 clients:
  - url: http://127.0.0.1:3100/loki/api/v1/push
 scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: chrysalis
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
 ```
 Now we can add promtail to your `configuration.nix` by creating a systemd
 service to run it with this snippet:
 ```nix
 # hosts/chrysalis/configuration.nix
  systemd.services.promtail = {
    description = "Promtail service for Loki";
    wantedBy = [ "multi-user.target" ];
    serviceConfig = {
      ExecStart = ''
        ${pkgs.grafana-loki}/bin/promtail --config.file ${./promtail.yaml}
      '';
    };
  };
 ```
 Now that you have this all set up, you can push this to your cluster with
 nixops:
 ```console
 $ nixops deploy -d home
 ```
 Once that finishes, open up Grafana and configure a new Loki data source with
 the URL `http://127.0.0.1:3100`:
 ![Loki Data Source
 configuration](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_161610.png)
 Now that you have Loki set up, let's query it! Open the Explore view in Grafana
 again, choose Loki as the source, and enter in the query `{job="systemd-journal"}`:
 ![Loki
 search](https://cdn.christine.website/file/christine-static/blog/Screenshot_20201120_162043.png)
 [You can also add Loki queries like this to dashboards! Loki also lets you query by
 systemd unit with the `unit` field. If you wanted to search for logs from
 `foo.service`, you would need a query that looks something like
 `{job="systemd-journal", unit="foo.service"}` You can do many more complicated
 things with Loki. Look <a
 href="https://grafana.com/docs/grafana/latest/datasources/loki/#search-expression">here
 </a> for more information on what you can query. As of the time of writing this
 blogpost, you are currently unable to make Grafana alerts based on Loki queries
 as far as I am aware.](conversation://Mara/hacker)
 ---
 This barely scrapes the surface of what you can accomplish with a setup like
 this. Using more fancy setups you can alert on the rate of metrics changing. I
 plan to make NixOS modules to make this setup easier in the future. There is
 also a set of options in
 [services.grafana.provision](https://search.nixos.org/options?channel=20.09&from=0&size=30&sort=relevance&query=grafana.provision)
 that can make it easier to automagically set up Grafana with per-host
 dashboards, alerts and all of the data sources that are outlined in this post.
 The setup in this post is quite meager, but it should be enough to get you
 started with whatever you need to monitor. Adding Prometheus metrics to your
 services will go a long way in terms of being able to better monitor things in
 production, do not be afraid to experiment!
--- a/css/hack.css
+++ b/css/hack.css
@ -218,7 +218,7 @@ a:hover {
  overflow: hidden;
 }
 .hack h1:after {
-  content: "====================================================================================================";
+  content: "===============================================================================================================================================================";
  position: absolute;
  bottom: 10px;
  left: 0;
@ -315,7 +315,7 @@ a:hover {
  margin: 20px 0;
 }
 .hack hr:after {
-  content: "----------------------------------------------------------------------------------------------------";
+  content: "---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";
  position: absolute;
  top: 0;
  left: 0;