A new Fibonacci benchmark demonstrates the improvement by using both
addition and subtraction in each iteration of the loop, like #200.
Before:
test fib2_100 ... bench: 4,558 ns/iter (+/- 3,357)
test fib2_1000 ... bench: 62,575 ns/iter (+/- 5,200)
test fib2_10000 ... bench: 2,898,425 ns/iter (+/- 207,973)
After:
test fib2_100 ... bench: 1,973 ns/iter (+/- 102)
test fib2_1000 ... bench: 41,203 ns/iter (+/- 947)
test fib2_10000 ... bench: 2,544,272 ns/iter (+/- 45,183)
This splits the main arithmetic loops from the final carry/borrow
propagation, and also normalizes the slice lengths before iteration.
This lets the optimizer generate better code for these functions.
If a benchmark takes very long to run, it's harder to iterate on changes
to see their effect. Even reduced to 100, this pow_bench takes around 8
seconds on my machine, and still shows meaningful optimization effects.