slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization
- Refactor the current functionality into a helper function
- Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
- Add a codegen test checking for vectorization and no panicking
- Add benches for `eq_ignore_ascii_case`
---
The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.
Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.
```
before:
str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20
str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06
after:
str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29
str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05
```
Refactor the current functionality into a helper function
Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
Add a codegen test
Add benches for `eq_ignore_ascii_case`
The optimized function is initially only enabled for x86_64 which has `sse2` as
part of its baseline, but none of the code is platform specific. Other
platforms with SIMD instructions may also benefit from this implementation.
Performance improvements only manifest for slices of 16 bytes or longer, so the
optimized path is gated behind a length check for greater than or equal to 16.
Use `std::mem::{size_of, size_of_val, align_of, align_of_val}` from the
prelude instead of importing or qualifying them.
These functions were added to all preludes in Rust 1.80.