12 KiB
title | date | series | tags | ||||
---|---|---|---|---|---|---|---|
pa'i Benchmarks | 2020-03-26 | olin |
|
pa'i Benchmarks
In my last post I mentioned that pa'i was faster than Olin's cwa binary written in go without giving any benchmarks. I've been working on new ways to gather and visualize these benchmarks, and here they are.
Benchmarking WebAssembly implementations is slightly hard. A lot of existing benchmark tools simply do not run in WebAssembly as is, not to mention inside the Olin ABI. However, I have created a few tasks that I feel represent common tasks that pa'i (and later wasmcloud) will run:
- compressing data with Snappy
- parsing JSON
- parsing yaml
- recursive fibbonacci number calculation
- blake-2 hashing
As always, if you don't trust my numbers, you don't have to. Commands will be given to run these benchmarks on your own hardware. This may not be the most scientifically accurate benchmarks possible, but it should help to give a reasonable idea of the speed gains from using Rust instead of Go.
You can run these benchmarks in the docker image xena/pahi
. You may need to
replace ./result/
with /
for running this inside Docker.
$ docker run --rm -it xena/pahi bash -l
Compressing Data with Snappy
This is implemented as cpustrain.wasm
. Here is the source code
used in the benchmark:
#![no_main]
#![feature(start)]
extern crate olin;
use olin::{entrypoint, Resource};
use std::io::Write;
entrypoint!();
fn main() -> Result<(), std::io::Error> {
let fout = Resource::open("null://").expect("opening /dev/null");
let data = include_bytes!("/proc/cpuinfo");
let mut writer = snap::write::FrameEncoder::new(fout);
for _ in 0..256 {
// compressed data
writer.write(data)?;
}
Ok(())
}
This compresses my machine's copy of /proc/cpuinfo 256 times. This number was chosen arbitrarily.
Here are the results I got from the following command:
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/cpustrain.wasm' \
'./result/bin/cwa result/wasm/cpustrain.wasm' \
'./result/bin/pahi --no-cache result/wasm/cpustrain.wasm' \
'./result/bin/pahi result/wasm/cpustrain.wasm'
CPU | cwa | pahi --no-cache | pahi | multiplier |
---|---|---|---|---|
Ryzen 5 3600 | 2.392 seconds | 38.6 milliseconds | 17.7 milliseconds | pahi is 135 times faster than cwa |
Intel Xeon E5-1650 | 7.652 seconds | 99.3 milliseconds | 53.7 milliseconds | pahi is 142 times faster than cwa |
Parsing JSON
This is implemented as bigjson.wasm
. Here is the source code of the
benchmark:
#![no_main]
#![feature(start)]
extern crate olin;
use olin::entrypoint;
use serde_json::{from_slice, to_string, Value};
entrypoint!();
fn main() -> Result<(), std::io::Error> {
let input = include_bytes!("./bigjson.json");
if let Ok(val) = from_slice(input) {
let v: Value = val;
if let Err(_why) = to_string(&v) {
return Err(std::io::Error::new(
std::io::ErrorKind::Other,
"oh no json encoding failed!",
));
}
} else {
return Err(std::io::Error::new(
std::io::ErrorKind::Other,
"oh no json parsing failed!",
));
}
Ok(())
}
This decodes and encodes this rather large json file. This is a very large file (over 64k of json) and should represent over 65536 times times the average json payload size.
Here are the results I got from the following command:
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/bigjson.wasm' \
'./result/bin/cwa result/wasm/bigjson.wasm' \
'./result/bin/pahi --no-cache result/wasm/bigjson.wasm' \
'./result/bin/pahi result/wasm/bigjson.wasm'
CPU | cwa | pahi --no-cache | pahi | multiplier |
---|---|---|---|---|
Ryzen 5 3600 | 257 milliseconds | 49.4 milliseconds | 20.4 milliseconds | pahi is 12.62 times faster than cwa |
Intel Xeon E5-1650 | 935.5 milliseconds | 135.4 milliseconds | 101.4 milliseconds | pahi is 9.22 times faster than cwa |
Parsing yaml
This is implemented as k8sparse.wasm
. Here is the source code of
the benchmark:
#![no_main]
#![feature(start)]
extern crate olin;
use olin::entrypoint;
use serde_yaml::{from_slice, to_string, Value};
entrypoint!();
fn main() -> Result<(), std::io::Error> {
let input = include_bytes!("./k8sparse.yaml");
if let Ok(val) = from_slice(input) {
let v: Value = val;
if let Err(_why) = to_string(&v) {
return Err(std::io::Error::new(
std::io::ErrorKind::Other,
"oh no yaml encoding failed!",
));
} else {
return Err(std::io::Error::new(
std::io::ErrorKind::Other,
"oh no yaml parsing failed!",
));
}
}
Ok(())
}
This decodes and encodes this kubernetes manifest set from my cluster. This is a set of a few normal kubernetes deployments and isn't as much of a worse-case scenario as it could be with the other tests.
Here are the results I got from running the following command:
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/k8sparse.wasm' \
'./result/bin/cwa result/wasm/k8sparse.wasm' \
'./result/bin/pahi --no-cache result/wasm/k8sparse.wasm' \
'./result/bin/pahi result/wasm/k8sparse.wasm'
CPU | cwa | pahi --no-cache | pahi | multiplier |
---|---|---|---|---|
Ryzen 5 3600 | 211.7 milliseconds | 125.3 milliseconds | 8.5 milliseconds | pahi is 25.04 times faster than cwa |
Intel Xeon E5-1650 | 674.1 milliseconds | 342.7 milliseconds | 30.8 milliseconds | pahi is 21.85 times faster than cwa |
Recursive Fibbonacci Number Calculation
This is implemented as fibber.wasm
. Here is the source code used in
the benchmark:
#![no_main]
#![feature(start)]
extern crate olin;
use olin::{entrypoint, log};
entrypoint!();
fn fib(n: u64) -> u64 {
if n <= 1 {
return 1;
}
fib(n - 1) + fib(n - 2)
}
fn main() -> Result<(), std::io::Error> {
log::info("starting");
fib(30);
log::info("done");
Ok(())
}
Fibbonacci number calculation done recursively is an incredibly time-complicated
ordeal. This is the worst possible case for this kind of calculation, as it
doesn't cache results from the fib
function.
Here are the results I got from running the following command:
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/fibber.wasm' \
'./result/bin/cwa result/wasm/fibber.wasm' \
'./result/bin/pahi --no-cache result/wasm/fibber.wasm' \
'./result/bin/pahi result/wasm/fibber.wasm'
CPU | cwa | pahi --no-cache | pahi | multiplier |
---|---|---|---|---|
Ryzen 5 3600 | 13.6 milliseconds | 13.7 milliseconds | 2.7 milliseconds | pahi is 5.13 times faster than cwa |
Intel Xeon E5-1650 | 41.0 milliseconds | 27.3 milliseconds | 7.2 milliseconds | pahi is 5.70 times faster than cwa |
Blake-2 Hashing
This is implemented as blake2stress.wasm
. Here's the source
code for this benchmark:
#![no_main]
#![feature(start)]
extern crate olin;
use blake2::{Blake2b, Digest};
use olin::{entrypoint, log};
entrypoint!();
fn main() -> Result<(), std::io::Error> {
let json: &'static [u8] = include_bytes!("./bigjson.json");
let yaml: &'static [u8] = include_bytes!("./k8sparse.yaml");
for _ in 0..8 {
let mut hasher = Blake2b::new();
hasher.input(json);
hasher.input(yaml);
hasher.result();
}
Ok(())
}
This runs the blake2b hashing algorithm on the JSON and yaml files used earlier eight times. This is supposed to represent a few hundred thousand invocations of production code.
Here are the results I got from running the following command:
$ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/blake2stress.wasm' \
'./result/bin/cwa result/wasm/blake2stress.wasm' \
'./result/bin/pahi --no-cache result/wasm/blake2stress.wasm' \
'./result/bin/pahi result/wasm/blake2stress.wasm'
CPU | cwa | pahi --no-cache | pahi | multiplier |
---|---|---|---|---|
Ryzen 5 3600 | 358.7 milliseconds | 17.4 milliseconds | 5.0 milliseconds | pahi is 71.76 times faster than cwa |
Intel Xeon E5-1650 | 1.351 seconds | 35.5 milliseconds | 11.7 milliseconds | pahi is 115.04 times faster than cwa |
Conclusions
From these tests, we can roughly conclude that pa'i is about 54 times faster than Olin's cwa tool. A lot of this speed gain is arguably the result of pa'i using an ahead of time compiler (namely cranelift as wrapped by wasmer). The compilation time also became a somewhat notable factor for comparing performance too, however the compilation cost only has to be eaten once.
Another conclusion I've made is very unsurprising. My old 2013 mac pro with an Intel Xeon E5-1650 is significantly slower in real-world computing tasks than the new Ryzen 5 3600. Both of these machines were using the same nix closure for running the binaries and they are running NixOS 20.03.
As always, if you have any feedback for what other kinds of benchmarks to run and how these benchmarks were collected, I welcome it. Please comment wherever this article is posted or contact me.
Here are the /proc/cpuinfo files for each machine being tested:
- shachi (Ryzen 5 3600) /proc/cpuinfo
- chrysalis (Intel Xeon E5-1650) /proc/cpuinfo
If you run these benchmarks on your own hardware and get different data, please let me know and I will be more than happy to add your results to these tables. I will need the CPU model name and the output of hyperfine for each of the above commands.