This splits the main arithmetic loops from the final carry/borrow
propagation, and also normalizes the slice lengths before iteration.
This lets the optimizer generate better code for these functions.
If a benchmark takes very long to run, it's harder to iterate on changes
to see their effect. Even reduced to 100, this pow_bench takes around 8
seconds on my machine, and still shows meaningful optimization effects.