Replace `core::arch` versions of the following with handwritten
assembly, which avoids recursion issues (cg_gcc using `rint` as a
fallback) as well as problems with `aarch64be`.
* `rint`
* `rintf`
Additionally, add assembly versions of the following:
* `fma`
* `fmaf`
* `sqrt`
* `sqrtf`
If the `fp16` target feature is available, which implies `neon`, also
include the following:
* `rintf16`
* `sqrtf16`
`sqrt` is added to match the implementation for `x86`. `fma` is included
since it is used by many other routines.
There are a handful of other operations that have assembly
implementations. They are omitted here because we should have basic
float math routines available in `core` in the near future, which will
allow us to defer to LLVM for assembly lowering rather than implementing
these ourselves.