This removes the need for a 128 bit storage by making use of the fact that
there can be either no over/underflow, either one or both, and each time
the target type suffices to hold the limit for comparison.
The downside is that the code looks a bit more complex.
This test code included in this commit is from @oyvindln 's PR. They also
greatly helped fixing a number of errors I made along the way. Thanks a lot!