pa'i benchmarks (#132)

* blog: pa'i benchmarks * blog/pa'i benchmarks: link fixes
2020-03-26 17:47:36 -04:00 · 2020-03-26 17:47:36 -04:00 · e2f46f264e
parent 7dd56c3190
commit e2f46f264e
2 changed files with 348 additions and 0 deletions
--- a/blog/pahi-benchmarks-2020-03-26.markdown
+++ b/blog/pahi-benchmarks-2020-03-26.markdown
@ -0,0 +1,345 @@
 ---
 title: "pa'i Benchmarks"
 date: 2020-03-26
 series: olin
 tags:
  - wasm
  - rust
  - golang
  - pahi
 ---
 # pa'i Benchmarks
 In my [last post][pahihelloworld] I mentioned that pa'i was faster than Olin's
 cwa binary written in go without giving any benchmarks. I've been working on new
 ways to gather and visualize these benchmarks, and here they are. 
 [pahihelloworld]: https://christine.website/blog/pahi-hello-world-2020-02-22
 Benchmarking WebAssembly implementations is slightly hard. A lot of existing
 benchmark tools simply do not run in WebAssembly as is, not to mention inside
 the Olin ABI. However, I have created a few tasks that I feel represent common
 tasks that pa'i (and later wasmcloud) will run:
 - compressing data with [Snappy][snappy]
 - parsing JSON
 - parsing yaml
 - recursive fibbonacci number calculation
 - blake-2 hashing
 As always, if you don't trust my numbers, you don't have to. Commands will be
 given to run these benchmarks on your own hardware. This may not be the most
 scientifically accurate benchmarks possible, but it should help to give a
 reasonable idea of the speed gains from using Rust instead of Go.
 You can run these benchmarks in the docker image `xena/pahi`. You may need to
 replace `./result/` with `/` for running this inside Docker.
 ```console
 $ docker run --rm -it xena/pahi bash -l
 ```
 [snappy]: https://en.wikipedia.org/wiki/Snappy_(compression)
 ## Compressing Data with Snappy
 This is implemented as [`cpustrain.wasm`][cpustrain]. Here is the source code
 used in the benchmark:
 [cpustrain]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/main.rs
 ```rust
 #![no_main]
 #![feature(start)]
 extern crate olin;
 use olin::{entrypoint, Resource};
 use std::io::Write;
 entrypoint!();
 fn main() -> Result<(), std::io::Error> {
    let fout = Resource::open("null://").expect("opening /dev/null");
    let data = include_bytes!("/proc/cpuinfo");
    let mut writer = snap::write::FrameEncoder::new(fout);
    for _ in 0..256 {
        // compressed data
        writer.write(data)?;
    }
    Ok(())
 }
 ```
 This compresses my machine's copy of [/proc/cpuinfo][proccpuinfo] 256 times.
 This number was chosen arbitrarily.
 [proccpuinfo]: https://clbin.com/rxAOg
 Here are the results I got from the following command:
 ```console
 $ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/cpustrain.wasm' \
        './result/bin/cwa result/wasm/cpustrain.wasm' \
        './result/bin/pahi --no-cache result/wasm/cpustrain.wasm' \
        './result/bin/pahi result/wasm/cpustrain.wasm'
 ```
 | CPU                | cwa           | pahi --no-cache   | pahi              | multiplier                        |
 | :----------------- | :------------ | :---------------- | :---------------- | :-------------------------------- |
 | Ryzen 5 3600       | 2.392 seconds | 38.6 milliseconds | 17.7 milliseconds | pahi is 135 times faster than cwa |
 | Intel Xeon E5-1650 | 7.652 seconds | 99.3 milliseconds | 53.7 milliseconds | pahi is 142 times faster than cwa |
 ## Parsing JSON
 This is implemented as [`bigjson.wasm`][bigjson]. Here is the source code of the
 benchmark:
 [bigjson]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/bigjson.rs
 ```rust
 #![no_main]
 #![feature(start)]
 extern crate olin;
 use olin::entrypoint;
 use serde_json::{from_slice, to_string, Value};
 entrypoint!();
 fn main() -> Result<(), std::io::Error> {
    let input = include_bytes!("./bigjson.json");
    if let Ok(val) = from_slice(input) {
        let v: Value = val;
        if let Err(_why) = to_string(&v) {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no json encoding failed!",
            ));
        }
    } else {
        return Err(std::io::Error::new(
            std::io::ErrorKind::Other,
            "oh no json parsing failed!",
        ));
    }
    Ok(())
 }
 ```
 This decodes and encodes this [rather large json file][bigjsonjson]. This is a
 very large file (over 64k of json) and should represent over 65536 times times
 the average json payload size.
 [bigjsonjson]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/bigjson.json
 Here are the results I got from the following command:
 ```console
 $ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/bigjson.wasm' \
        './result/bin/cwa result/wasm/bigjson.wasm' \
        './result/bin/pahi --no-cache result/wasm/bigjson.wasm' \
        './result/bin/pahi result/wasm/bigjson.wasm'
 ```
 | CPU                | cwa                | pahi --no-cache    | pahi               | multiplier                          |
 | :----------------- | :------------      | :----------------  | :----------------  | :--------------------------------   |
 | Ryzen 5 3600       | 257 milliseconds   | 49.4 milliseconds  | 20.4 milliseconds  | pahi is 12.62 times faster than cwa |
 | Intel Xeon E5-1650 | 935.5 milliseconds | 135.4 milliseconds | 101.4 milliseconds | pahi is 9.22 times faster than cwa  |
 ## Parsing yaml
 This is implemented as [`k8sparse.wasm`][k8sparse]. Here is the source code of
 the benchmark:
 [k8sparse]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/k8sparse.rs
 ```rust
 #![no_main]
 #![feature(start)]
 extern crate olin;
 use olin::entrypoint;
 use serde_yaml::{from_slice, to_string, Value};
 entrypoint!();
 fn main() -> Result<(), std::io::Error> {
    let input = include_bytes!("./k8sparse.yaml");
    if let Ok(val) = from_slice(input) {
        let v: Value = val;
        if let Err(_why) = to_string(&v) {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no yaml encoding failed!",
            ));
        } else {
            return Err(std::io::Error::new(
                std::io::ErrorKind::Other,
                "oh no yaml parsing failed!",
            ));
        }
    }
    Ok(())
 }
 ```
 This decodes and encodes this [kubernetes manifest set from my
 cluster][k8sparseyaml]. This is a set of a few normal kubernetes deployments and
 isn't as much of a worse-case scenario as it could be with the other tests.
 [k8sparseyaml]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/k8sparse.yaml#L1
 Here are the results I got from running the following command:
 ```console
 $ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/k8sparse.wasm' \
        './result/bin/cwa result/wasm/k8sparse.wasm' \
        './result/bin/pahi --no-cache result/wasm/k8sparse.wasm' \
        './result/bin/pahi result/wasm/k8sparse.wasm'
 ```
 | CPU                | cwa                | pahi --no-cache    | pahi              | multiplier                          |
 | :----------------- | :------------      | :----------------  | :---------------- | :--------------------------------   |
 | Ryzen 5 3600       | 211.7 milliseconds | 125.3 milliseconds | 8.5 milliseconds  | pahi is 25.04 times faster than cwa |
 | Intel Xeon E5-1650 | 674.1 milliseconds | 342.7 milliseconds | 30.8 milliseconds | pahi is 21.85 times faster than cwa |
 ## Recursive Fibbonacci Number Calculation
 This is implemented as [`fibber.wasm`][fibber]. Here is the source code used in
 the benchmark:
 [fibber]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/fibber.rs
 ```rust
 #![no_main]
 #![feature(start)]
 extern crate olin;
 use olin::{entrypoint, log};
 entrypoint!();
 fn fib(n: u64) -> u64 {
    if n <= 1 {
        return 1;
    }
    fib(n - 1) + fib(n - 2)
 }
 fn main() -> Result<(), std::io::Error> {
    log::info("starting");
    fib(30);
    log::info("done");
    Ok(())
 }
 ```
 Fibbonacci number calculation done recursively is an incredibly time-complicated
 ordeal. This is the worst possible case for this kind of calculation, as it
 doesn't cache results from the `fib` function. 
 Here are the results I got from running the following command:
 ```console
 $ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/fibber.wasm' \
        './result/bin/cwa result/wasm/fibber.wasm' \
        './result/bin/pahi --no-cache result/wasm/fibber.wasm' \
        './result/bin/pahi result/wasm/fibber.wasm'
 ```
 | CPU                | cwa               | pahi --no-cache   | pahi              | multiplier                         |
 | :----------------- | :------------     | :---------------- | :---------------- | :--------------------------------  |
 | Ryzen 5 3600       | 13.6 milliseconds | 13.7 milliseconds | 2.7 milliseconds  | pahi is 5.13 times faster than cwa |
 | Intel Xeon E5-1650 | 41.0 milliseconds | 27.3 milliseconds | 7.2 milliseconds  | pahi is 5.70 times faster than cwa |
 ## Blake-2 Hashing
 This is implemented as [`blake2stress.wasm`][blake2stress]. Here's the source
 code for this benchmark:
 [blake2stress]: https://github.com/Xe/pahi/blob/96f051d16df35cbceb8bf802e7dd7482b41b7d8a/wasm/cpustrain/src/bin/blake2stress.rs
 ```rust
 #![no_main]
 #![feature(start)]
 extern crate olin;
 use blake2::{Blake2b, Digest};
 use olin::{entrypoint, log};
 entrypoint!();
 fn main() -> Result<(), std::io::Error> {
    let json: &'static [u8] = include_bytes!("./bigjson.json");
    let yaml: &'static [u8] = include_bytes!("./k8sparse.yaml");
    for _ in 0..8 {
        let mut hasher = Blake2b::new();
        hasher.input(json);
        hasher.input(yaml);
        hasher.result();
    }
    Ok(())
 }
 ```
 This runs the [blake2b hashing algorithm][blake2b] on the JSON and yaml files
 used earlier eight times. This is supposed to represent a few hundred thousand
 invocations of production code.
 [blake2b]: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2b_algorithm
 Here are the results I got from running the following command:
 ```console
 $ hyperfine --warmup 3 --prepare './result/bin/pahi result/wasm/blake2stress.wasm' \
        './result/bin/cwa result/wasm/blake2stress.wasm' \
        './result/bin/pahi --no-cache result/wasm/blake2stress.wasm' \
        './result/bin/pahi result/wasm/blake2stress.wasm'
 ```
 | CPU                | cwa                | pahi --no-cache   | pahi              | multiplier                           |
 | :----------------- | :------------      | :---------------- | :---------------- | :--------------------------------    |
 | Ryzen 5 3600       | 358.7 milliseconds | 17.4 milliseconds | 5.0 milliseconds  | pahi is 71.76 times faster than cwa  |
 | Intel Xeon E5-1650 | 1.351 seconds      | 35.5 milliseconds | 11.7 milliseconds | pahi is 115.04 times faster than cwa |
 ## Conclusions
 From these tests, we can roughly conclude that pa'i is about 54 times faster
 than Olin's cwa tool. A lot of this speed gain is arguably the result of pa'i
 using an ahead of time compiler (namely cranelift as wrapped by wasmer). The
 compilation time also became a somewhat notable factor for comparing performance
 too, however the compilation cost only has to be eaten once.
 Another conclusion I've made is very unsurprising. My old 2013 mac pro with an
 Intel Xeon E5-1650 is _significantly_ slower in real-world computing tasks than
 the new Ryzen 5 3600. Both of these machines were using the same nix closure for
 running the binaries and they are running NixOS 20.03. 
 As always, if you have any feedback for what other kinds of benchmarks to run
 and how these benchmarks were collected, I welcome it. Please comment wherever
 this article is posted or [contact me](/contact).
 Here are the /proc/cpuinfo files for each machine being tested:
 - shachi (Ryzen 5 3600) [/proc/cpuinfo](https://clbin.com/Nilnm)
 - chrysalis (Intel Xeon E5-1650) [/proc/cpuinfo](https://clbin.com/24HM1)
 If you run these benchmarks on your own hardware and get different data, please
 let me know and I will be more than happy to add your results to these tables. I
 will need the CPU model name and the output of hyperfine for each of the above
 commands.
--- a/shell.nix
+++ b/shell.nix
@ -23,5 +23,8 @@ mkShell {
    # dependency manager
    niv
    # tools
    ispell
  ];
 }