rust/library
bors 22bcb81c66 Auto merge of #122770 - iximeow:ixi/int-formatting-optimization, r=workingjubilee
improve codegen of fmt_num to delete unreachable panic

it seems LLVM doesn't realize that `curr` is always decremented at least once in either loop formatting characters of the input string by their appropriate radix, and so the later `&buf[curr..]` generates a check for out-of-bounds access and panic. this is unreachable in reality as even for `x == T::zero()` we'll produce at least the character `Self::digit(T::zero())`, yielding at least one character output, and `curr` will always be at least one below `buf.len()`.

adjust `fmt_int` to make this fact more obvious to the compiler, which fortunately (or unfortunately) results in a measurable performance improvement for workloads heavy on formatting integers.

in the program i'd noticed this in, you can see the `cmp $0x80,%rdi; ja 7c` here, which branches to a slice index fail helper:
<img width="660" alt="before" src="https://github.com/rust-lang/rust/assets/4615790/ac482d54-21f8-494b-9c83-4beadc3ca0ef">

where after this change the function is broadly similar, but smaller, with one fewer registers updated in each pass through the loop in addition the never-taken `cmp/ja` being gone:
<img width="646" alt="after" src="https://github.com/rust-lang/rust/assets/4615790/1bee1d76-b674-43ec-9b21-4587364563aa">

this represents a ~2-3% difference in runtime in my [admittedly comically i32-formatting-bound](https://github.com/athre0z/disas-bench/blob/master/bench/yaxpeax/src/main.rs#L58-L67) use case (printing x86 instructions, including i32 displacements and immediates) as measured on a ryzen 9 3950x.

the impact on `<impl LowerHex for i8>::fmt` is both more dramatic and less impactful: it continues to have a loop that is evaluated at most twice, though the compiler doesn't know that to unroll it. the generated code there is identical to the impl for `i32`. there, the smaller loop body has less effect on runtime, and removing the never-taken slice bounds check is offset by whatever address recalculation is happening with the `lea/add/neg` at the end of the loop. it behaves about the same before and after.

---

i initially measured slightly better outcomes using `unreachable_unchecked()` here instead, but that was hacking on std and rebuilding with `-Z build-std` on an older rustc (nightly 5b377cece, 2023-06-30). it does not yield better outcomes now, so i see no reason to proceed with that approach at all.

<details>
<summary>initial notes about that, seemingly irrelevant on modern rustc</summary>
i went through a few tries at getting llvm to understand the bounds check isn't necessary, but i should mention the _best_ i'd seen here was actually from the existing `fmt_int` with a diff like
```diff
        if x == zero {
            // No more digits left to accumulate.
            break;
        };
    }
}
+
+ if curr >= buf.len() {
+     unsafe { core::hint::unreachable_unchecked(); }
+ }
let buf = &buf[curr..];
```

posting a random PR to `rust-lang/rust` to do that without a really really compelling reason seemed a bit absurd, so i tried to work that into something that seems more palatable at a glance. but if you're interested, that certainly produced better (x86_64) code through LLVM. in that case with `buf.iter_mut().rev()` as the iterator, `<impl LowerHex for i8>::fmt` actually unrolls into something like

```
put_char(x & 0xf);
let mut len = 1;
if x > 0xf {
  put_char((x >> 4) & 0xf);
  len = 2;
}
pad_integral(buf[buf.len() - len..]);
```

it's pretty cool! `<impl LowerHex for i32>::fmt` also was slightly better. that all resulted in closer to an 6% difference in my use case.

</details>

---

i have not looked at formatters other than LowerHex/UpperHex with this change, though i'd be a bit shocked if any were _worse_.

(i have absolutely _no_ idea how you'd regression test this, but that might be just my not knowing what the right tool for that would be in rust-lang/rust. i'm of half a mind that this is small and fiddly enough to not be worth landing lest it quietly regress in the future anyway. but i didn't want to discard the idea without at least offering it upstream here)
2024-11-14 04:17:20 +00:00
..
alloc Rollup merge of #132929 - cuviper:check-alloc_zeroed, r=tgross35 2024-11-12 06:27:20 +01:00
backtrace@230570f2da Try latest backtrace 2024-08-29 12:13:19 -07:00
core Auto merge of #122770 - iximeow:ixi/int-formatting-optimization, r=workingjubilee 2024-11-14 04:17:20 +00:00
panic_abort step cfg(bootstrap) 2024-07-28 14:46:29 -04:00
panic_unwind move strict provenance lints to new feature gate, remove old feature gates 2024-10-21 15:22:17 +01:00
portable-simd move strict provenance lints to new feature gate, remove old feature gates 2024-10-21 15:22:17 +01:00
proc_macro Add a new trait proc_macro::ToTokens 2024-10-27 03:17:05 +01:00
profiler_builtins Run cargo update and update licenses 2024-11-13 12:22:10 +00:00
rtstartup Let InstCombine remove Clone shims inside Clone shims 2024-07-25 15:14:42 -04:00
rustc-std-workspace-alloc Replace libstd, libcore, liballoc in line comments. 2022-12-30 14:00:42 +01:00
rustc-std-workspace-core update rustc-std-workspace crates 2024-11-04 07:45:15 +01:00
rustc-std-workspace-std Switch all libraries to the 2021 edition 2021-12-23 19:03:47 +08:00
std Rollup merge of #132869 - lolbinarycat:library-fix-too_long_first_doc_paragraph, r=tgross35 2024-11-12 06:27:19 +01:00
stdarch@ff9a444503 bump stdarch 2024-10-28 09:42:42 +01:00
sysroot Make profiler_builtins an optional dependency of sysroot, not std 2024-10-17 22:08:36 +11:00
test Change a few &Option<T> into Option<&T> 2024-10-08 23:26:29 -04:00
unwind move strict provenance lints to new feature gate, remove old feature gates 2024-10-21 15:22:17 +01:00
windows_targets Win: Add dbghelp to the list of import libraries 2024-09-06 21:21:49 +00:00
Cargo.lock Run cargo update and update licenses 2024-11-13 12:22:10 +00:00
Cargo.toml Auto merge of #129063 - the8472:cold-opt-size, r=Amanieu 2024-09-02 00:58:50 +00:00