blog: add post on lokahi

2018-02-08 22:47:13 -08:00 · 2018-02-08 22:47:13 -08:00 · a12904faf8
parent 3b50232ae3
commit a12904faf8
4 changed files with 625 additions and 298 deletions
--- a/blog/lokahi-2018-02-08.markdown
+++ b/blog/lokahi-2018-02-08.markdown
@ -0,0 +1,299 @@
 ---
 title: Introducing Lokahi
 date: 2018-02-08
 github_issue: https://github.com/Xe/lokahi/issues/15
 ---
 # Introducing Lokahi
 This week at Heroku, there was a hackweek. I decided to tackle a few problems at
 once and this is the result. The two big things I wanted to tackle were building
 a scalable HTTP health checking service and unlocking the "flow" state of 
 consciousness to make developing, understanding and improving this project a lot
 easier.
 ## lokahi
 Lokahi is a http service uptime checking and notification service. Currently
 lokahi does very little. Given a URL and a webhook URL, lokahi runs checks every
 minute on that URL and ensures it's up. If the URL goes down or the health 
 workers have trouble getting to the URL, the service is flagged as down and a
 webhook is sent out.
 ### Stack
 | What      | Role          |
 | :-------- | :------------ |
 | Postgres  | Database      |
 | Go        | Language      |
 | [Twirp](https://twitchtv.github.io/twirp/docs/intro.html)   | API layer     |
 | Protobuf  | Serialization |
 | Nats      | Message queue |
 | Cobra     | CLI           |
 ### Components
 Interrelation graph:
 ![interrelation graph of lokahi components, see /static/img/lokahi.dot for the graphviz]("/static/img/lokahi.png")
 #### lokahictl
 The command line interface, currently outputs everything in JSON. It currently
 has a few options:
 ```console
 $ ./bin/lokahictl
 See https://github.com/Xe/lokahi for more information
 Usage:
  lokahictl [command]
 Available Commands:
  create      creates a check
  create_load creates a bunch of checks
  delete      deletes a check
  get         dumps information about a check
  help        Help about any command
  list        lists all checks that you have permission to access
  put         puts updates to a check
  run         runs a check
  runstats    gets performance information
 Flags:
  -h, --help            help for lokahictl
      --server string   http url of the lokahid instance (default "http://AzureDiamond:hunter2@127.0.0.1:24253")
 Use "lokahictl [command] --help" for more information about a command.
 ```
 Each of these subcommands has help and most of them have additional flags.
 #### lokahid
 This is the main API server. It exposes twirp services defined in [`xe.github.lokahi`](https://github.com/Xe/lokahi/blob/master/rpc/lokahi/lokahi.proto)
 and [`xe.github.lokahi.admin`](https://github.com/Xe/lokahi/blob/master/rpc/lokahiadmin/lokahiadmin.proto).
 It is configured using environment variables like so:
 ```shell
 # Username and password to use for checking authentication
 # http://bash.org/?244321
 USERPASS=AzureDiamond:hunter2
 # Postgres database URL in heroku-ish format
 DATABASE_URL=postgres://postgres:hunter2@127.0.0.1:5432/postgres?sslmode=disable
 # Nats queue URL
 NATS_URL=nats://127.0.0.1:4222
 # TCP port to listen on for HTTP traffic
 PORT=9001
 ```
 Every minute, lokahid will scan for every check that is set to run minutely and
 run them. Running checks any time but minutely is currently unsupported.
 #### healthworker
 healthworker listens on nats queue `check.run` and returns health information
 about that service. 
 #### webhookworker
 webhookworker listens on nats queue `webhook.egress` and sends webhooks based on
 the input it's given.
 ### Challenges Faced During Development
 #### ORM Issues
 Initially, I implemented this using [gorm](https://github.com/jinzhu/gorm) and
 started to run into a lot of problems when using it in anything but small
 scale circumstances. Gorm spun up way too many database connections (as many as
 a new one for every operation!) and quickly exhausted postgres' pool of client.
 connections.
 I rewrote this to use [`database/sql`](https://godoc.org/database/sql) and
 [`sqlx`](https://godoc.org/github.com/jmoiron/sqlx) and all of the tests passed
 the first time I tried to run this, no joke.
 #### Scaling to 50,000 Checks
 This one was actually a lot harder than I thought it would be, and not for the
 reasons I thought it would be. One of the main things that I discovered when
 I was trying to scale this was that I was putting way too much load on the 
 database way too quickly. 
 The solution to this was to use [bundler](https://godoc.org/google.golang.org/api/support/bundler)
 to batch-write the most frequently written database items, see [here](https://github.com/Xe/lokahi/blob/7fc03120f731def3a351ddd516430feb635345b4/internal/lokahiadminserver/local_run.go#L245).
 Even then, [database connection count limiting](https://godoc.org/database/sql#DB.SetMaxOpenConns)
 was also needed in order to scale to the full 50,000 checks needed for this
 to exist as more than a proof of concept.
 This service can handle 50,000 HTTP checks in a minute. The only part that gets
 backed up currently is webhook egress, but that is likely fixable with further
 optimization on the HTTP checking and webhook egress paths.
 ### Basic Usage
 To set up an instance of lokahi on a machine with [Docker Compose](https://docs.docker.com/compose/)
 installed, create a docker compose manifest with the following in it:
 ```yaml
 version: "3.1"
 services:
  # The postgres database where all lokahi data is stored.
  db:
    image: postgres:alpine
    restart: always
    environment:
      POSTGRES_PASSWORD: hunter2
    command: postgres -c max_connections=1000
  # The message queue for lokahid and its workers.
  nats:
    image: nats:1.0.4
  # The service that runs http healthchecks. This is its own service so it can
  # be scaled independently.
  healthworker:
    image: xena/lokahi:latest
    restart: always
    depends_on:
      - "db"
      - "nats"
    environment:
      NATS_URL: nats://nats:4222
      DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
    command: healthworker
  # The service that sends out webhooks in response to http healthchecks. This
  # is also its own service so it can be scaled independently.
  webhookworker:
    image: xena/lokahi:latest
    restart: always
    depends_on:
      - "db"
      - "nats"
    environment:
      NATS_URL: nats://nats:4222
      DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
    command: webhookworker
  # The main API server. This is what you port forward to.
  lokahid:
    image: xena/lokahi:latest
    restart: always
    depends_on:
      - "db"
      - "nats"
    environment:
      USERPASS: AzureDiamond:hunter2 # want ideas? https://strongpasswordgenerator.com/
      NATS_URL: nats://nats:4222
      DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
      PORT: 24253
    ports:
      - 24253:24253
  # This is a sample webhook server that prints information about incoming 
  # webhooks.
  samplehook:
    image: xena/lokahi:latest
    restart: always
    depends_on:
      - "lokahid"
    environment:
      PORT: 9001
    command: sample_hook
  # Duke is a service that gets approximately 50% uptime by changing between up
  # and down every minute. When it's up, it responds to every HTTP request with
  # 200. When it's down, it responds to every HTTP request with 500.
  duke:
    image: xena/lokahi:latest
    restart: always
    depends_on:
      - "samplehook"
    environment:
      PORT: 9001
    command: duke-of-york
 ```
 Start this with `docker-compose up -d`.
 #### Configuration
 Open `~/.lokahictl.hcl` and enter in the following:
 ```hcl
 server = "http://AzureDiamond:hunter2@127.0.0.1:24253"
 ```
 Save this and then lokahictl is now configured to work with the local copy of lokahi.
 #### Creating a check
 To create a check against duke reporting to samplehook:
 ```
 $ lokahictl create \
    --every 60 \
    --webhook-url http://samplehook:9001/twirp/github.xe.lokahi.Webhook/Handle \
    --url http://duke:9001 \
    --playbook-url https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook
 {
  "id": "a5c7179a-0d3a-11e8-b53d-8faa88cfa70c",
  "url": "http://duke:9001",
  "webhook_url": "http://samplehook:9001/twirp/github.xe.lokahi.Webhook/Handle",
  "every": 60,
  "playbook_url": "https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook"
 }
 ```
 Now attach to samplehook's logs and wait for it:
 ```
 $ docker-compose -f samplehook
 2018/02/09 06:27:15 check id: a5c7179a-0d3a-11e8-b53d-8faa88cfa70c, 
  state: DOWN, latency: 2.265561ms, status code: 500, 
  playbook url: https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook
 ```
 ### Webhooks
 Webhooks get a HTTP POST of a protobuf-encoded [`xe.github.lokahi.CheckStatus`](https://github.com/Xe/lokahi/blob/13bc98ff0665ab13044f08d51ed2141ca0c38647/rpc/lokahi/lokahi.proto#L83)
 with the following additional HTTP headers:
 | Key            | Value                                        |
 | :------------- | :------------------------------------------- |
 | `Accept`       | `application/protobuf`                       |
 | `Content-Type` | `application/protobuf`                       |
 | `User-Agent`   | `lokahi/dev (+https://github.com/Xe/lokahi)` |
 Webhook server implementations should probably store check ID's in a database of
 some kind and trigger additional logic, such as Pagerduty API calls or similar
 things. The lokahi standard distribution includes [Discord](https://github.com/Xe/lokahi/tree/master/cmd/discord_hook)
 and [Slack](https://github.com/Xe/lokahi/tree/master/cmd/slack_hook) webhook
 receivers. 
 JSON webhook support is not currently implemented, but is being tracked at
 [this github issue](https://github.com/Xe/lokahi/issues/4). 
 ### Call for Contributions
 Lokahi is pretty great as it is, but to be even better lokahi needs a bunch
 of work, experience reports and people willing to contribute to the project.
 If making a better HTTP uptime service sounds like something you want to do with
 your free time, please get involved! Ask questions, fix issues, help newcomers
 and help us all work together to make the best HTTP uptime service we can.
 ---
 Social media links for discussion on this article: 
 Mastodon:
 Reddit:
 Hacker News:
 Twitter:
--- a/rice-box.go
+++ b/rice-box.go
--- a/static/img/lokahi.dot
+++ b/static/img/lokahi.dot
@ -0,0 +1,14 @@
 digraph G {
  lokahictl -> lokahid [ label="http+twirp" ]
  lokahid -> nats
  lokahid -> postgres
  nats -> webhookworker [ label="webhook.egress" ]
  webhookworker -> your_stack
  healthworker -> nats [ label="replies" ]
  nats -> healthworker [ label="check.run" ]
  healthworker -> postgres
  healthworker -> your_stack
 }
--- a/static/img/lokahi.png
+++ b/static/img/lokahi.png