A new Fibonacci benchmark demonstrates the improvement by using both
addition and subtraction in each iteration of the loop, like #200.
Before:
test fib2_100 ... bench: 4,558 ns/iter (+/- 3,357)
test fib2_1000 ... bench: 62,575 ns/iter (+/- 5,200)
test fib2_10000 ... bench: 2,898,425 ns/iter (+/- 207,973)
After:
test fib2_100 ... bench: 1,973 ns/iter (+/- 102)
test fib2_1000 ... bench: 41,203 ns/iter (+/- 947)
test fib2_10000 ... bench: 2,544,272 ns/iter (+/- 45,183)
This splits the main arithmetic loops from the final carry/borrow
propagation, and also normalizes the slice lengths before iteration.
This lets the optimizer generate better code for these functions.
When x.len() and y.len() are both equal and odd, we have x1.len() + y1.len() + 1
(the required size to multiply x1 and y1) > y.len() + 1
This fixes#187