rust/src/libcore
Josh Stone c70cdc0ed4
Rollup merge of #59283 - SimonSapin:branchless-ascii-case, r=joshtriplett
Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](https://github.com/rust-lang/rust/pull/59283/commits/ce933f77c865a15670855ac5941fe200752b739f#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from https://github.com/rust-lang/rust/pull/59283#issuecomment-474146863)

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```
2019-03-27 18:15:25 -07:00
..
benches ASCII uppercase: add "subtract multiplied bool" benchmark 2019-03-19 13:41:59 +01:00
char Rollup merge of #58778 - xfix:exact_size_case_mapping_iter, r=SimonSapin 2019-03-19 15:16:49 +01:00
fmt avoid unnecessary use of MaybeUninit::get_ref, and expand comment on the others 2019-02-22 23:05:58 +01:00
future Put Future trait into spotlight 2019-02-20 22:06:30 +01:00
hash libs: doc comments 2019-02-10 23:57:25 +00:00
iter Replaced self-reflective explicit types with clearer Self or Self::… in stdlib docs 2019-03-18 13:57:51 +01:00
num Rollup merge of #59283 - SimonSapin:branchless-ascii-case, r=joshtriplett 2019-03-27 18:15:25 -07:00
ops Rollup merge of #59190 - greg-kargin:master, r=sanxiyn 2019-03-22 19:31:21 +01:00
prelude Remove licenses 2018-12-25 21:08:33 -07:00
slice Rollup merge of #59328 - koalatux:iter-nth-back, r=scottmcm 2019-03-24 19:00:10 +08:00
str Improvements to comments in libstd, libcore, liballoc. 2019-03-11 02:25:44 +00:00
sync Bootstrap compiler update for 1.35 release 2019-03-02 09:05:34 -07:00
task Improvements to comments in libstd, libcore, liballoc. 2019-03-11 02:25:44 +00:00
tests Rollup merge of #59328 - koalatux:iter-nth-back, r=scottmcm 2019-03-24 19:00:10 +08:00
unicode Remove licenses 2018-12-25 21:08:33 -07:00
alloc.rs heading # Unsafety => # Safety in stdlib docs. 2019-02-25 08:01:35 +01:00
any.rs tests: doc comments 2019-02-10 23:42:32 +00:00
array.rs Use lifetime contravariance to elide more lifetimes in core+alloc+std 2019-03-09 19:10:28 -08:00
ascii.rs Remove licenses 2018-12-25 21:08:33 -07:00
borrow.rs Remove licenses 2018-12-25 21:08:33 -07:00
Cargo.toml std: Depend directly on crates.io crates 2018-12-11 21:08:22 -08:00
cell.rs Stabilize refcell_map_split feature 2019-03-18 15:06:34 -07:00
clone.rs Auto merge of #57125 - doitian:inconsistent-clone-doc, r=bluss 2019-01-01 20:50:13 +00:00
cmp.rs Clarify {Ord,f32,f64}::clamp docs a little 2019-03-25 12:52:42 +01:00
convert.rs Rollup merge of #59268 - estebank:from-string, r=QuietMisdreavus 2019-03-27 18:15:24 -07:00
default.rs libs: doc comments 2019-02-10 23:57:25 +00:00
ffi.rs core: ensure VaList passes improper_ctypes lint 2019-03-05 13:43:48 +00:00
hint.rs Fix undefined behavior in hint::spin_loop for x86 targets without SSE2 2019-03-21 14:23:29 +01:00
internal_macros.rs Use lifetime contravariance to elide more lifetimes in core+alloc+std 2019-03-09 19:10:28 -08:00
intrinsics.rs Bootstrap compiler update for 1.35 release 2019-03-02 09:05:34 -07:00
iter_private.rs Remove licenses 2018-12-25 21:08:33 -07:00
lib.rs Stabilize unrestricted_attribute_tokens 2019-02-25 23:21:54 +03:00
macros.rs Add todo!() macro 2019-03-18 19:27:31 +03:00
marker.rs Bootstrap compiler update for 1.35 release 2019-03-02 09:05:34 -07:00
mem.rs Improvements to comments in libstd, libcore, liballoc. 2019-03-11 02:25:44 +00:00
option.rs Update src/libcore/option.rs 2019-03-25 11:48:08 +01:00
panic.rs Remove licenses 2018-12-25 21:08:33 -07:00
panicking.rs Remove licenses 2018-12-25 21:08:33 -07:00
pin.rs Rollup merge of #58939 - taeguk:fix-doc-about-pin, r=rkruppe 2019-03-19 15:16:53 +01:00
ptr.rs Rollup merge of #59427 - czipperz:non_null_doc_links, r=Mark-Simulacrum 2019-03-26 22:26:45 +01:00
raw.rs Remove licenses 2018-12-25 21:08:33 -07:00
result.rs add missing braces 2019-03-25 11:50:11 +01:00
time.rs fix typo 2019-03-12 16:42:18 +03:00
tuple.rs Remove licenses 2018-12-25 21:08:33 -07:00
unit.rs Remove licenses 2018-12-25 21:08:33 -07:00