rust/library/core/src at ec44d48ae396e596f07ecc496f95da2b5ec36223 - user0/rust

user0/rust

History

Dylan DPC 7fb55b4c3a Rollup merge of #94212 - scottmcm:swapper, r=dtolnay Stop manually SIMDing in `swap_nonoverlapping` Like I previously did for `reverse` (#90821), this leaves it to LLVM to pick how to vectorize it, since it can know better the chunk size to use, compared to the "32 bytes always" approach we currently have. A variety of codegen tests are included to confirm that the various cases are still being vectorized. It does still need logic to type-erase in some cases, though, as while LLVM is now smart enough to vectorize over slices of things like `[u8; 4]`, it fails to do so over slices of `[u8; 3]`. As a bonus, this change also means one no longer gets the spurious `memcpy`(s?) at the end up swapping a slice of `__m256`s: <https://rust.godbolt.org/z/joofr4v8Y> <details> <summary>ASM for this example</summary> ## Before (from godbolt) note the `push`/`pop`s and `memcpy` ```x86 swap_m256_slice: push r15 push r14 push r13 push r12 push rbx sub rsp, 32 cmp rsi, rcx jne .LBB0_6 mov r14, rsi shl r14, 5 je .LBB0_6 mov r15, rdx mov rbx, rdi xor eax, eax .LBB0_3: mov rcx, rax vmovaps ymm0, ymmword ptr [rbx + rax] vmovaps ymm1, ymmword ptr [r15 + rax] vmovaps ymmword ptr [rbx + rax], ymm1 vmovaps ymmword ptr [r15 + rax], ymm0 add rax, 32 add rcx, 64 cmp rcx, r14 jbe .LBB0_3 sub r14, rax jbe .LBB0_6 add rbx, rax add r15, rax mov r12, rsp mov r13, qword ptr [rip + memcpy@GOTPCREL] mov rdi, r12 mov rsi, rbx mov rdx, r14 vzeroupper call r13 mov rdi, rbx mov rsi, r15 mov rdx, r14 call r13 mov rdi, r15 mov rsi, r12 mov rdx, r14 call r13 .LBB0_6: add rsp, 32 pop rbx pop r12 pop r13 pop r14 pop r15 vzeroupper ret ``` ## After (from my machine) Note no `rsp` manipulation, sorry for different ASM syntax ```x86 swap_m256_slice: cmpq %r9, %rdx jne .LBB1_6 testq %rdx, %rdx je .LBB1_6 cmpq $1, %rdx jne .LBB1_7 xorl %r10d, %r10d jmp .LBB1_4 .LBB1_7: movq %rdx, %r9 andq $-2, %r9 movl $32, %eax xorl %r10d, %r10d .p2align 4, 0x90 .LBB1_8: vmovaps -32(%rcx,%rax), %ymm0 vmovaps -32(%r8,%rax), %ymm1 vmovaps %ymm1, -32(%rcx,%rax) vmovaps %ymm0, -32(%r8,%rax) vmovaps (%rcx,%rax), %ymm0 vmovaps (%r8,%rax), %ymm1 vmovaps %ymm1, (%rcx,%rax) vmovaps %ymm0, (%r8,%rax) addq $2, %r10 addq $64, %rax cmpq %r10, %r9 jne .LBB1_8 .LBB1_4: testb $1, %dl je .LBB1_6 shlq $5, %r10 vmovaps (%rcx,%r10), %ymm0 vmovaps (%r8,%r10), %ymm1 vmovaps %ymm1, (%rcx,%r10) vmovaps %ymm0, (%r8,%r10) .LBB1_6: vzeroupper retq ``` </details> This does all its copying operations as either the original type or as `MaybeUninit`s, so as far as I know there should be no potential abstract machine issues with reading padding bytes as integers. <details> <summary>Perf is essentially unchanged</summary> Though perhaps with more target features this would help more, if it could pick bigger chunks ## Before ``` running 10 tests test slice::swap_with_slice_4x_usize_30 ... bench: 894 ns/iter (+/- 11) test slice::swap_with_slice_4x_usize_3000 ... bench: 99,476 ns/iter (+/- 2,784) test slice::swap_with_slice_5x_usize_30 ... bench: 1,257 ns/iter (+/- 7) test slice::swap_with_slice_5x_usize_3000 ... bench: 139,922 ns/iter (+/- 959) test slice::swap_with_slice_rgb_30 ... bench: 328 ns/iter (+/- 27) test slice::swap_with_slice_rgb_3000 ... bench: 16,215 ns/iter (+/- 176) test slice::swap_with_slice_u8_30 ... bench: 312 ns/iter (+/- 9) test slice::swap_with_slice_u8_3000 ... bench: 5,401 ns/iter (+/- 123) test slice::swap_with_slice_usize_30 ... bench: 368 ns/iter (+/- 3) test slice::swap_with_slice_usize_3000 ... bench: 28,472 ns/iter (+/- 3,913) ``` ## After ``` running 10 tests test slice::swap_with_slice_4x_usize_30 ... bench: 868 ns/iter (+/- 36) test slice::swap_with_slice_4x_usize_3000 ... bench: 99,642 ns/iter (+/- 1,507) test slice::swap_with_slice_5x_usize_30 ... bench: 1,194 ns/iter (+/- 11) test slice::swap_with_slice_5x_usize_3000 ... bench: 139,761 ns/iter (+/- 5,018) test slice::swap_with_slice_rgb_30 ... bench: 324 ns/iter (+/- 6) test slice::swap_with_slice_rgb_3000 ... bench: 15,962 ns/iter (+/- 287) test slice::swap_with_slice_u8_30 ... bench: 281 ns/iter (+/- 5) test slice::swap_with_slice_u8_3000 ... bench: 5,324 ns/iter (+/- 40) test slice::swap_with_slice_usize_30 ... bench: 275 ns/iter (+/- 5) test slice::swap_with_slice_usize_3000 ... bench: 28,277 ns/iter (+/- 277) ``` </detail>		2022-02-24 21:42:14 +01:00
..
alloc	Fix a bunch of typos	2021-12-14 16:40:43 +01:00
array	Fix a typo in documentation of `array::IntoIter::new_unchecked`	2022-02-23 21:10:04 +03:00
async_iter	Move `{core,std}::stream::Stream` to `{core,std}::async_iter::AsyncIterator`.	2022-02-03 21:03:06 +08:00
char	fix	2022-02-17 22:14:54 -08:00
convert	Rollup merge of #89869 - kpreid:from-doc, r=yaahc	2022-02-17 06:29:57 +01:00
fmt	Suggest calling .display() on PathBuf too	2022-02-21 16:58:12 -08:00
future	Rollup merge of #91192 - r00ster91:futuredocs, r=GuillaumeGomez	2022-02-21 19:36:46 +01:00
hash	change PhantomData type for BuildHasherDefault	2022-01-07 00:39:48 +01:00
iter	Add a `try_collect()` helper method to `Iterator`	2022-02-16 14:26:39 -08:00
macros	add link to format_args! when being mentioned in doc	2022-02-12 12:35:30 +08:00
mem	Stop manually SIMDing in swap_nonoverlapping	2022-02-21 00:54:02 -08:00
num	Stabilise inherent_ascii_escape (FCP in #77174 )	2022-02-12 13:21:59 -05:00
ops	Rollup merge of #94283 - hellow554:stable_flow_control, r=Dylan-DPC	2022-02-24 07:48:08 +01:00
panic	Rollup merge of #93613 - crlf0710:rename_to_async_iter, r=yaahc	2022-02-18 16:23:32 +01:00
prelude	update cfg(bootstrap)s	2022-01-28 15:01:07 +01:00
ptr	Stop manually SIMDing in swap_nonoverlapping	2022-02-21 00:54:02 -08:00
slice	Rollup merge of #93686 - dbrgn:trim-on-byte-slices, r=joshtriplett	2022-02-20 00:37:23 +01:00
str	Add {floor,ceil}_char_boundary methods to str	2022-02-07 13:34:08 -05:00
sync	Rollup merge of #89869 - kpreid:from-doc, r=yaahc	2022-02-17 06:29:57 +01:00
task	Rollup merge of #89869 - kpreid:from-doc, r=yaahc	2022-02-17 06:29:57 +01:00
unicode	Regenerate tables for Unicode 14.0.0	2021-10-06 17:49:33 -07:00
any.rs	Reverts #92135 because perf regression	2021-12-26 16:02:33 +03:00
ascii.rs	Add #[must_use] to remaining core functions	2021-10-30 18:21:29 -04:00
bool.rs	Constify `bool::then{,_some}`	2021-12-15 00:11:23 +08:00
borrow.rs	Make `Borrow` and `BorrowMut` impls `const`	2021-12-04 21:57:39 +09:00
cell.rs	Rollup merge of #89869 - kpreid:from-doc, r=yaahc	2022-02-17 06:29:57 +01:00
clone.rs	Update Copy/Clone documentation WRT arrays	2021-11-08 13:11:59 -05:00
cmp.rs	Edit docs introduction for `std::cmp::PartialOrd`	2022-01-28 00:46:04 -06:00
default.rs	Add #[must_use] to remaining core functions	2021-10-30 18:21:29 -04:00
ffi.rs	Use `target_family = "wasm"`	2021-11-10 08:35:42 -08:00
hint.rs	Add is_riscv_feature_detected!; modify impl of hint::spin_loop	2022-01-05 15:44:52 +08:00
internal_macros.rs	Added docs to internal_macro const	2021-10-22 10:07:35 +13:00
intrinsics.rs	Document about some behaviors of `const_(de)allocate` and add some tests.	2022-01-29 19:13:23 +09:00
lazy.rs	Rollup merge of #89869 - kpreid:from-doc, r=yaahc	2022-02-17 06:29:57 +01:00
lib.rs	Rollup merge of #93613 - crlf0710:rename_to_async_iter, r=yaahc	2022-02-18 16:23:32 +01:00
marker.rs	Update Copy/Clone documentation WRT arrays	2021-11-08 13:11:59 -05:00
option.rs	`Option::and_then` basic example: show failure	2022-02-12 12:23:38 +08:00
panic.rs	Fix invalid special casing of the unreachable! macro	2022-01-31 17:09:31 +01:00
panicking.rs	Guard against unwinding in cleanup code	2022-02-13 03:10:09 +00:00
pin.rs	Rollup merge of #94128 - mqy:master, r=Dylan-DPC	2022-02-23 12:26:40 +01:00
primitive.rs	mv std libs to library/	2020-07-27 19:51:13 -05:00
primitive_docs.rs	Fix annotation of code blocks	2022-02-01 21:44:53 +00:00
result.rs	Add note on Windows path behaviour	2022-02-12 12:52:42 +08:00
time.rs	Improve Duration::try_from_secs_f32/64 accuracy by directly processing exponent and mantissa	2022-01-26 18:14:25 +03:00
tuple.rs	mv std libs to library/	2020-07-27 19:51:13 -05:00
unit.rs	mv std libs to library/	2020-07-27 19:51:13 -05:00