forked from cadey/xesite
pa'i benchmarks (#132)
* blog: pa'i benchmarks * blog/pa'i benchmarks: link fixes
This commit is contained in:
parent
7dd56c3190
commit
e2f46f264e
|
@ -0,0 +1,345 @@
|
|||
---
|
||||
title: "pa'i Benchmarks"
|
||||
date: 2020-03-26
|
||||
series: olin
|
||||
tags:
|
||||
- wasm
|
||||
- rust
|
||||
- golang
|
||||
- pahi
|
||||
---
|
||||
|
||||
# pa'i Benchmarks
|
||||
|
||||
In my [last post][pahihelloworld] I mentioned that pa'i was faster than Olin's
|
||||
cwa binary written in go without giving any benchmarks. I've been working on new
|
||||
ways to gather and visualize these benchmarks, and here they are.
|
||||
|
||||
[pahihelloworld]: https://christine.website/blog/pahi-hello-world-2020-02-22
|
||||
|
||||
Benchmarking WebAssembly implementations is slightly hard. A lot of existing
|
||||
benchmark tools simply do not run in WebAssembly as is, not to mention inside
|
||||
the Olin ABI. However, I have created a few tasks that I feel represent common
|
||||
tasks that pa'i (and later wasmcloud) will run:
|
||||
|
||||
- compressing data with [Snappy][snappy]
|
||||
- parsing JSON
|
||||
- parsing yaml
|
||||
- recursive fibbonacci number calculation
|
||||
- blake-2 hashing
|
||||
|
||||
As always, if you don't trust my numbers, you don't have to. Commands will be
|
||||
given to run these benchmarks on your own hardware. This may not be the most
|
||||
scientifically accurate benchmarks possible, but it should help to give a
|
||||
reasonable idea of the speed gains from using Rust instead of Go.
|
||||
|
||||
You can run these benchmarks in the docker image `xena/pahi`. You may need to
|
||||
replace `./result/` with `/` for running this inside Docker.
|
||||
|
||||
```console
|
||||
$ docker run --rm -it xena/pahi bash -l
|
||||
```
|
||||
|
||||
[snappy]: https://en.wikipedia.org/wiki/Snappy_(compression)
|
||||
|
||||
## Compressing Data with Snappy
|
||||
|
||||
This is implemented as [`cpustrain.wasm`][cpustrain]. Here is the source code
|
||||
used in the benchmark:
|
||||
|
||||
[cpustrain]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/main.rs
|
||||
|
||||
```rust
|
||||
#![no_main]
|
||||
#![feature(start)]
|
||||
|
||||
extern crate olin;
|
||||
|
||||
use olin::{entrypoint, Resource};
|
||||
use std::io::Write;
|
||||
|
||||
entrypoint!();
|
||||
|
||||
fn main() -> Result<(), std::io::Error> {
|
||||
let fout = Resource::open("null://").expect("opening /dev/null");
|
||||
let data = include_bytes!("/proc/cpuinfo");
|
||||
|
||||
let mut writer = snap::write::FrameEncoder::new(fout);
|
||||
|
||||
for _ in 0..256 {
|
||||
// compressed data
|
||||
writer.write(data)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
This compresses my machine's copy of [/proc/cpuinfo][proccpuinfo] 256 times.
|
||||
This number was chosen arbitrarily.
|
||||
|
||||
[proccpuinfo]: https://clbin.com/rxAOg
|
||||
|
||||
Here are the results I got from the following command:
|
||||
|
||||
```console
|
||||
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/cpustrain.wasm' \
|
||||
'./result/bin/cwa result/wasm/cpustrain.wasm' \
|
||||
'./result/bin/pahi --no-cache result/wasm/cpustrain.wasm' \
|
||||
'./result/bin/pahi result/wasm/cpustrain.wasm'
|
||||
```
|
||||
|
||||
| CPU | cwa | pahi --no-cache | pahi | multiplier |
|
||||
| :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
|
||||
| Ryzen 5 3600 | 2.392 seconds | 38.6 milliseconds | 17.7 milliseconds | pahi is 135 times faster than cwa |
|
||||
| Intel Xeon E5-1650 | 7.652 seconds | 99.3 milliseconds | 53.7 milliseconds | pahi is 142 times faster than cwa |
|
||||
|
||||
## Parsing JSON
|
||||
|
||||
This is implemented as [`bigjson.wasm`][bigjson]. Here is the source code of the
|
||||
benchmark:
|
||||
|
||||
[bigjson]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/bigjson.rs
|
||||
|
||||
```rust
|
||||
|
||||
#![no_main]
|
||||
#![feature(start)]
|
||||
|
||||
extern crate olin;
|
||||
|
||||
use olin::entrypoint;
|
||||
use serde_json::{from_slice, to_string, Value};
|
||||
|
||||
entrypoint!();
|
||||
|
||||
fn main() -> Result<(), std::io::Error> {
|
||||
let input = include_bytes!("./bigjson.json");
|
||||
|
||||
if let Ok(val) = from_slice(input) {
|
||||
let v: Value = val;
|
||||
if let Err(_why) = to_string(&v) {
|
||||
return Err(std::io::Error::new(
|
||||
std::io::ErrorKind::Other,
|
||||
"oh no json encoding failed!",
|
||||
));
|
||||
}
|
||||
} else {
|
||||
return Err(std::io::Error::new(
|
||||
std::io::ErrorKind::Other,
|
||||
"oh no json parsing failed!",
|
||||
));
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
This decodes and encodes this [rather large json file][bigjsonjson]. This is a
|
||||
very large file (over 64k of json) and should represent over 65536 times times
|
||||
the average json payload size.
|
||||
|
||||
[bigjsonjson]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/bigjson.json
|
||||
|
||||
Here are the results I got from the following command:
|
||||
|
||||
```console
|
||||
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/bigjson.wasm' \
|
||||
'./result/bin/cwa result/wasm/bigjson.wasm' \
|
||||
'./result/bin/pahi --no-cache result/wasm/bigjson.wasm' \
|
||||
'./result/bin/pahi result/wasm/bigjson.wasm'
|
||||
```
|
||||
|
||||
| CPU | cwa | pahi --no-cache | pahi | multiplier |
|
||||
| :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
|
||||
| Ryzen 5 3600 | 257 milliseconds | 49.4 milliseconds | 20.4 milliseconds | pahi is 12.62 times faster than cwa |
|
||||
| Intel Xeon E5-1650 | 935.5 milliseconds | 135.4 milliseconds | 101.4 milliseconds | pahi is 9.22 times faster than cwa |
|
||||
|
||||
## Parsing yaml
|
||||
|
||||
This is implemented as [`k8sparse.wasm`][k8sparse]. Here is the source code of
|
||||
the benchmark:
|
||||
|
||||
[k8sparse]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/k8sparse.rs
|
||||
|
||||
```rust
|
||||
#![no_main]
|
||||
#![feature(start)]
|
||||
|
||||
extern crate olin;
|
||||
|
||||
use olin::entrypoint;
|
||||
use serde_yaml::{from_slice, to_string, Value};
|
||||
|
||||
entrypoint!();
|
||||
|
||||
fn main() -> Result<(), std::io::Error> {
|
||||
let input = include_bytes!("./k8sparse.yaml");
|
||||
|
||||
if let Ok(val) = from_slice(input) {
|
||||
let v: Value = val;
|
||||
if let Err(_why) = to_string(&v) {
|
||||
return Err(std::io::Error::new(
|
||||
std::io::ErrorKind::Other,
|
||||
"oh no yaml encoding failed!",
|
||||
));
|
||||
} else {
|
||||
return Err(std::io::Error::new(
|
||||
std::io::ErrorKind::Other,
|
||||
"oh no yaml parsing failed!",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
This decodes and encodes this [kubernetes manifest set from my
|
||||
cluster][k8sparseyaml]. This is a set of a few normal kubernetes deployments and
|
||||
isn't as much of a worse-case scenario as it could be with the other tests.
|
||||
|
||||
[k8sparseyaml]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/k8sparse.yaml#L1
|
||||
|
||||
Here are the results I got from running the following command:
|
||||
|
||||
```console
|
||||
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/k8sparse.wasm' \
|
||||
'./result/bin/cwa result/wasm/k8sparse.wasm' \
|
||||
'./result/bin/pahi --no-cache result/wasm/k8sparse.wasm' \
|
||||
'./result/bin/pahi result/wasm/k8sparse.wasm'
|
||||
```
|
||||
|
||||
| CPU | cwa | pahi --no-cache | pahi | multiplier |
|
||||
| :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
|
||||
| Ryzen 5 3600 | 211.7 milliseconds | 125.3 milliseconds | 8.5 milliseconds | pahi is 25.04 times faster than cwa |
|
||||
| Intel Xeon E5-1650 | 674.1 milliseconds | 342.7 milliseconds | 30.8 milliseconds | pahi is 21.85 times faster than cwa |
|
||||
|
||||
## Recursive Fibbonacci Number Calculation
|
||||
|
||||
This is implemented as [`fibber.wasm`][fibber]. Here is the source code used in
|
||||
the benchmark:
|
||||
|
||||
[fibber]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/fibber.rs
|
||||
|
||||
```rust
|
||||
#![no_main]
|
||||
#![feature(start)]
|
||||
|
||||
extern crate olin;
|
||||
|
||||
use olin::{entrypoint, log};
|
||||
|
||||
entrypoint!();
|
||||
|
||||
fn fib(n: u64) -> u64 {
|
||||
if n <= 1 {
|
||||
return 1;
|
||||
}
|
||||
fib(n - 1) + fib(n - 2)
|
||||
}
|
||||
|
||||
fn main() -> Result<(), std::io::Error> {
|
||||
log::info("starting");
|
||||
fib(30);
|
||||
log::info("done");
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
Fibbonacci number calculation done recursively is an incredibly time-complicated
|
||||
ordeal. This is the worst possible case for this kind of calculation, as it
|
||||
doesn't cache results from the `fib` function.
|
||||
|
||||
Here are the results I got from running the following command:
|
||||
|
||||
```console
|
||||
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/fibber.wasm' \
|
||||
'./result/bin/cwa result/wasm/fibber.wasm' \
|
||||
'./result/bin/pahi --no-cache result/wasm/fibber.wasm' \
|
||||
'./result/bin/pahi result/wasm/fibber.wasm'
|
||||
```
|
||||
|
||||
| CPU | cwa | pahi --no-cache | pahi | multiplier |
|
||||
| :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
|
||||
| Ryzen 5 3600 | 13.6 milliseconds | 13.7 milliseconds | 2.7 milliseconds | pahi is 5.13 times faster than cwa |
|
||||
| Intel Xeon E5-1650 | 41.0 milliseconds | 27.3 milliseconds | 7.2 milliseconds | pahi is 5.70 times faster than cwa |
|
||||
|
||||
## Blake-2 Hashing
|
||||
|
||||
This is implemented as [`blake2stress.wasm`][blake2stress]. Here's the source
|
||||
code for this benchmark:
|
||||
|
||||
[blake2stress]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/blake2stress.rs
|
||||
|
||||
```rust
|
||||
#![no_main]
|
||||
#![feature(start)]
|
||||
|
||||
extern crate olin;
|
||||
|
||||
use blake2::{Blake2b, Digest};
|
||||
use olin::{entrypoint, log};
|
||||
|
||||
entrypoint!();
|
||||
|
||||
fn main() -> Result<(), std::io::Error> {
|
||||
let json: &'static [u8] = include_bytes!("./bigjson.json");
|
||||
let yaml: &'static [u8] = include_bytes!("./k8sparse.yaml");
|
||||
for _ in 0..8 {
|
||||
let mut hasher = Blake2b::new();
|
||||
hasher.input(json);
|
||||
hasher.input(yaml);
|
||||
hasher.result();
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
This runs the [blake2b hashing algorithm][blake2b] on the JSON and yaml files
|
||||
used earlier eight times. This is supposed to represent a few hundred thousand
|
||||
invocations of production code.
|
||||
|
||||
[blake2b]: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2b_algorithm
|
||||
|
||||
Here are the results I got from running the following command:
|
||||
|
||||
```console
|
||||
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/blake2stress.wasm' \
|
||||
'./result/bin/cwa result/wasm/blake2stress.wasm' \
|
||||
'./result/bin/pahi --no-cache result/wasm/blake2stress.wasm' \
|
||||
'./result/bin/pahi result/wasm/blake2stress.wasm'
|
||||
```
|
||||
|
||||
| CPU | cwa | pahi --no-cache | pahi | multiplier |
|
||||
| :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
|
||||
| Ryzen 5 3600 | 358.7 milliseconds | 17.4 milliseconds | 5.0 milliseconds | pahi is 71.76 times faster than cwa |
|
||||
| Intel Xeon E5-1650 | 1.351 seconds | 35.5 milliseconds | 11.7 milliseconds | pahi is 115.04 times faster than cwa |
|
||||
|
||||
## Conclusions
|
||||
|
||||
From these tests, we can roughly conclude that pa'i is about 54 times faster
|
||||
than Olin's cwa tool. A lot of this speed gain is arguably the result of pa'i
|
||||
using an ahead of time compiler (namely cranelift as wrapped by wasmer). The
|
||||
compilation time also became a somewhat notable factor for comparing performance
|
||||
too, however the compilation cost only has to be eaten once.
|
||||
|
||||
Another conclusion I've made is very unsurprising. My old 2013 mac pro with an
|
||||
Intel Xeon E5-1650 is _significantly_ slower in real-world computing tasks than
|
||||
the new Ryzen 5 3600. Both of these machines were using the same nix closure for
|
||||
running the binaries and they are running NixOS 20.03.
|
||||
|
||||
As always, if you have any feedback for what other kinds of benchmarks to run
|
||||
and how these benchmarks were collected, I welcome it. Please comment wherever
|
||||
this article is posted or [contact me](/contact).
|
||||
|
||||
Here are the /proc/cpuinfo files for each machine being tested:
|
||||
|
||||
- shachi (Ryzen 5 3600) [/proc/cpuinfo](https://clbin.com/Nilnm)
|
||||
- chrysalis (Intel Xeon E5-1650) [/proc/cpuinfo](https://clbin.com/24HM1)
|
||||
|
||||
If you run these benchmarks on your own hardware and get different data, please
|
||||
let me know and I will be more than happy to add your results to these tables. I
|
||||
will need the CPU model name and the output of hyperfine for each of the above
|
||||
commands.
|
Loading…
Reference in New Issue