forked from cadey/xesite
blog: add post on lokahi
This commit is contained in:
parent
3b50232ae3
commit
a12904faf8
|
@ -0,0 +1,299 @@
|
||||||
|
---
|
||||||
|
title: Introducing Lokahi
|
||||||
|
date: 2018-02-08
|
||||||
|
github_issue: https://github.com/Xe/lokahi/issues/15
|
||||||
|
---
|
||||||
|
|
||||||
|
# Introducing Lokahi
|
||||||
|
|
||||||
|
This week at Heroku, there was a hackweek. I decided to tackle a few problems at
|
||||||
|
once and this is the result. The two big things I wanted to tackle were building
|
||||||
|
a scalable HTTP health checking service and unlocking the "flow" state of
|
||||||
|
consciousness to make developing, understanding and improving this project a lot
|
||||||
|
easier.
|
||||||
|
|
||||||
|
## lokahi
|
||||||
|
|
||||||
|
Lokahi is a http service uptime checking and notification service. Currently
|
||||||
|
lokahi does very little. Given a URL and a webhook URL, lokahi runs checks every
|
||||||
|
minute on that URL and ensures it's up. If the URL goes down or the health
|
||||||
|
workers have trouble getting to the URL, the service is flagged as down and a
|
||||||
|
webhook is sent out.
|
||||||
|
|
||||||
|
### Stack
|
||||||
|
|
||||||
|
| What | Role |
|
||||||
|
| :-------- | :------------ |
|
||||||
|
| Postgres | Database |
|
||||||
|
| Go | Language |
|
||||||
|
| [Twirp](https://twitchtv.github.io/twirp/docs/intro.html) | API layer |
|
||||||
|
| Protobuf | Serialization |
|
||||||
|
| Nats | Message queue |
|
||||||
|
| Cobra | CLI |
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
Interrelation graph:
|
||||||
|
|
||||||
|
![interrelation graph of lokahi components, see /static/img/lokahi.dot for the graphviz]("/static/img/lokahi.png")
|
||||||
|
|
||||||
|
#### lokahictl
|
||||||
|
|
||||||
|
The command line interface, currently outputs everything in JSON. It currently
|
||||||
|
has a few options:
|
||||||
|
|
||||||
|
```console
|
||||||
|
$ ./bin/lokahictl
|
||||||
|
See https://github.com/Xe/lokahi for more information
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
lokahictl [command]
|
||||||
|
|
||||||
|
Available Commands:
|
||||||
|
create creates a check
|
||||||
|
create_load creates a bunch of checks
|
||||||
|
delete deletes a check
|
||||||
|
get dumps information about a check
|
||||||
|
help Help about any command
|
||||||
|
list lists all checks that you have permission to access
|
||||||
|
put puts updates to a check
|
||||||
|
run runs a check
|
||||||
|
runstats gets performance information
|
||||||
|
|
||||||
|
Flags:
|
||||||
|
-h, --help help for lokahictl
|
||||||
|
--server string http url of the lokahid instance (default "http://AzureDiamond:hunter2@127.0.0.1:24253")
|
||||||
|
|
||||||
|
Use "lokahictl [command] --help" for more information about a command.
|
||||||
|
```
|
||||||
|
|
||||||
|
Each of these subcommands has help and most of them have additional flags.
|
||||||
|
|
||||||
|
#### lokahid
|
||||||
|
|
||||||
|
This is the main API server. It exposes twirp services defined in [`xe.github.lokahi`](https://github.com/Xe/lokahi/blob/master/rpc/lokahi/lokahi.proto)
|
||||||
|
and [`xe.github.lokahi.admin`](https://github.com/Xe/lokahi/blob/master/rpc/lokahiadmin/lokahiadmin.proto).
|
||||||
|
It is configured using environment variables like so:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
# Username and password to use for checking authentication
|
||||||
|
# http://bash.org/?244321
|
||||||
|
USERPASS=AzureDiamond:hunter2
|
||||||
|
|
||||||
|
# Postgres database URL in heroku-ish format
|
||||||
|
DATABASE_URL=postgres://postgres:hunter2@127.0.0.1:5432/postgres?sslmode=disable
|
||||||
|
|
||||||
|
# Nats queue URL
|
||||||
|
NATS_URL=nats://127.0.0.1:4222
|
||||||
|
|
||||||
|
# TCP port to listen on for HTTP traffic
|
||||||
|
PORT=9001
|
||||||
|
```
|
||||||
|
|
||||||
|
Every minute, lokahid will scan for every check that is set to run minutely and
|
||||||
|
run them. Running checks any time but minutely is currently unsupported.
|
||||||
|
|
||||||
|
#### healthworker
|
||||||
|
|
||||||
|
healthworker listens on nats queue `check.run` and returns health information
|
||||||
|
about that service.
|
||||||
|
|
||||||
|
#### webhookworker
|
||||||
|
|
||||||
|
webhookworker listens on nats queue `webhook.egress` and sends webhooks based on
|
||||||
|
the input it's given.
|
||||||
|
|
||||||
|
### Challenges Faced During Development
|
||||||
|
|
||||||
|
#### ORM Issues
|
||||||
|
|
||||||
|
Initially, I implemented this using [gorm](https://github.com/jinzhu/gorm) and
|
||||||
|
started to run into a lot of problems when using it in anything but small
|
||||||
|
scale circumstances. Gorm spun up way too many database connections (as many as
|
||||||
|
a new one for every operation!) and quickly exhausted postgres' pool of client.
|
||||||
|
connections.
|
||||||
|
|
||||||
|
I rewrote this to use [`database/sql`](https://godoc.org/database/sql) and
|
||||||
|
[`sqlx`](https://godoc.org/github.com/jmoiron/sqlx) and all of the tests passed
|
||||||
|
the first time I tried to run this, no joke.
|
||||||
|
|
||||||
|
#### Scaling to 50,000 Checks
|
||||||
|
|
||||||
|
This one was actually a lot harder than I thought it would be, and not for the
|
||||||
|
reasons I thought it would be. One of the main things that I discovered when
|
||||||
|
I was trying to scale this was that I was putting way too much load on the
|
||||||
|
database way too quickly.
|
||||||
|
|
||||||
|
The solution to this was to use [bundler](https://godoc.org/google.golang.org/api/support/bundler)
|
||||||
|
to batch-write the most frequently written database items, see [here](https://github.com/Xe/lokahi/blob/7fc03120f731def3a351ddd516430feb635345b4/internal/lokahiadminserver/local_run.go#L245).
|
||||||
|
Even then, [database connection count limiting](https://godoc.org/database/sql#DB.SetMaxOpenConns)
|
||||||
|
was also needed in order to scale to the full 50,000 checks needed for this
|
||||||
|
to exist as more than a proof of concept.
|
||||||
|
|
||||||
|
This service can handle 50,000 HTTP checks in a minute. The only part that gets
|
||||||
|
backed up currently is webhook egress, but that is likely fixable with further
|
||||||
|
optimization on the HTTP checking and webhook egress paths.
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
To set up an instance of lokahi on a machine with [Docker Compose](https://docs.docker.com/compose/)
|
||||||
|
installed, create a docker compose manifest with the following in it:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
version: "3.1"
|
||||||
|
|
||||||
|
services:
|
||||||
|
# The postgres database where all lokahi data is stored.
|
||||||
|
db:
|
||||||
|
image: postgres:alpine
|
||||||
|
restart: always
|
||||||
|
environment:
|
||||||
|
POSTGRES_PASSWORD: hunter2
|
||||||
|
command: postgres -c max_connections=1000
|
||||||
|
|
||||||
|
# The message queue for lokahid and its workers.
|
||||||
|
nats:
|
||||||
|
image: nats:1.0.4
|
||||||
|
|
||||||
|
# The service that runs http healthchecks. This is its own service so it can
|
||||||
|
# be scaled independently.
|
||||||
|
healthworker:
|
||||||
|
image: xena/lokahi:latest
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
- "db"
|
||||||
|
- "nats"
|
||||||
|
environment:
|
||||||
|
NATS_URL: nats://nats:4222
|
||||||
|
DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
|
||||||
|
command: healthworker
|
||||||
|
|
||||||
|
# The service that sends out webhooks in response to http healthchecks. This
|
||||||
|
# is also its own service so it can be scaled independently.
|
||||||
|
webhookworker:
|
||||||
|
image: xena/lokahi:latest
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
- "db"
|
||||||
|
- "nats"
|
||||||
|
environment:
|
||||||
|
NATS_URL: nats://nats:4222
|
||||||
|
DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
|
||||||
|
command: webhookworker
|
||||||
|
|
||||||
|
# The main API server. This is what you port forward to.
|
||||||
|
lokahid:
|
||||||
|
image: xena/lokahi:latest
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
- "db"
|
||||||
|
- "nats"
|
||||||
|
environment:
|
||||||
|
USERPASS: AzureDiamond:hunter2 # want ideas? https://strongpasswordgenerator.com/
|
||||||
|
NATS_URL: nats://nats:4222
|
||||||
|
DATABASE_URL: postgres://postgres:hunter2@db:5432/postgres?sslmode=disable
|
||||||
|
PORT: 24253
|
||||||
|
ports:
|
||||||
|
- 24253:24253
|
||||||
|
|
||||||
|
# This is a sample webhook server that prints information about incoming
|
||||||
|
# webhooks.
|
||||||
|
samplehook:
|
||||||
|
image: xena/lokahi:latest
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
- "lokahid"
|
||||||
|
environment:
|
||||||
|
PORT: 9001
|
||||||
|
command: sample_hook
|
||||||
|
|
||||||
|
# Duke is a service that gets approximately 50% uptime by changing between up
|
||||||
|
# and down every minute. When it's up, it responds to every HTTP request with
|
||||||
|
# 200. When it's down, it responds to every HTTP request with 500.
|
||||||
|
duke:
|
||||||
|
image: xena/lokahi:latest
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
- "samplehook"
|
||||||
|
environment:
|
||||||
|
PORT: 9001
|
||||||
|
command: duke-of-york
|
||||||
|
```
|
||||||
|
|
||||||
|
Start this with `docker-compose up -d`.
|
||||||
|
|
||||||
|
#### Configuration
|
||||||
|
|
||||||
|
Open `~/.lokahictl.hcl` and enter in the following:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
server = "http://AzureDiamond:hunter2@127.0.0.1:24253"
|
||||||
|
```
|
||||||
|
|
||||||
|
Save this and then lokahictl is now configured to work with the local copy of lokahi.
|
||||||
|
|
||||||
|
#### Creating a check
|
||||||
|
|
||||||
|
To create a check against duke reporting to samplehook:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ lokahictl create \
|
||||||
|
--every 60 \
|
||||||
|
--webhook-url http://samplehook:9001/twirp/github.xe.lokahi.Webhook/Handle \
|
||||||
|
--url http://duke:9001 \
|
||||||
|
--playbook-url https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook
|
||||||
|
{
|
||||||
|
"id": "a5c7179a-0d3a-11e8-b53d-8faa88cfa70c",
|
||||||
|
"url": "http://duke:9001",
|
||||||
|
"webhook_url": "http://samplehook:9001/twirp/github.xe.lokahi.Webhook/Handle",
|
||||||
|
"every": 60,
|
||||||
|
"playbook_url": "https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Now attach to samplehook's logs and wait for it:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ docker-compose -f samplehook
|
||||||
|
2018/02/09 06:27:15 check id: a5c7179a-0d3a-11e8-b53d-8faa88cfa70c,
|
||||||
|
state: DOWN, latency: 2.265561ms, status code: 500,
|
||||||
|
playbook url: https://github.com/Xe/lokahi/wiki/duke-of-york-Playbook
|
||||||
|
```
|
||||||
|
|
||||||
|
### Webhooks
|
||||||
|
|
||||||
|
Webhooks get a HTTP POST of a protobuf-encoded [`xe.github.lokahi.CheckStatus`](https://github.com/Xe/lokahi/blob/13bc98ff0665ab13044f08d51ed2141ca0c38647/rpc/lokahi/lokahi.proto#L83)
|
||||||
|
with the following additional HTTP headers:
|
||||||
|
|
||||||
|
| Key | Value |
|
||||||
|
| :------------- | :------------------------------------------- |
|
||||||
|
| `Accept` | `application/protobuf` |
|
||||||
|
| `Content-Type` | `application/protobuf` |
|
||||||
|
| `User-Agent` | `lokahi/dev (+https://github.com/Xe/lokahi)` |
|
||||||
|
|
||||||
|
Webhook server implementations should probably store check ID's in a database of
|
||||||
|
some kind and trigger additional logic, such as Pagerduty API calls or similar
|
||||||
|
things. The lokahi standard distribution includes [Discord](https://github.com/Xe/lokahi/tree/master/cmd/discord_hook)
|
||||||
|
and [Slack](https://github.com/Xe/lokahi/tree/master/cmd/slack_hook) webhook
|
||||||
|
receivers.
|
||||||
|
|
||||||
|
JSON webhook support is not currently implemented, but is being tracked at
|
||||||
|
[this github issue](https://github.com/Xe/lokahi/issues/4).
|
||||||
|
|
||||||
|
### Call for Contributions
|
||||||
|
|
||||||
|
Lokahi is pretty great as it is, but to be even better lokahi needs a bunch
|
||||||
|
of work, experience reports and people willing to contribute to the project.
|
||||||
|
|
||||||
|
If making a better HTTP uptime service sounds like something you want to do with
|
||||||
|
your free time, please get involved! Ask questions, fix issues, help newcomers
|
||||||
|
and help us all work together to make the best HTTP uptime service we can.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Social media links for discussion on this article:
|
||||||
|
|
||||||
|
Mastodon:
|
||||||
|
Reddit:
|
||||||
|
Hacker News:
|
||||||
|
Twitter:
|
608
rice-box.go
608
rice-box.go
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,14 @@
|
||||||
|
digraph G {
|
||||||
|
lokahictl -> lokahid [ label="http+twirp" ]
|
||||||
|
|
||||||
|
lokahid -> nats
|
||||||
|
lokahid -> postgres
|
||||||
|
|
||||||
|
nats -> webhookworker [ label="webhook.egress" ]
|
||||||
|
webhookworker -> your_stack
|
||||||
|
|
||||||
|
healthworker -> nats [ label="replies" ]
|
||||||
|
nats -> healthworker [ label="check.run" ]
|
||||||
|
healthworker -> postgres
|
||||||
|
healthworker -> your_stack
|
||||||
|
}
|
Binary file not shown.
After Width: | Height: | Size: 41 KiB |
Loading…
Reference in New Issue