rust/library/core/src
Dylan DPC 7fb55b4c3a
Rollup merge of #94212 - scottmcm:swapper, r=dtolnay
Stop manually SIMDing in `swap_nonoverlapping`

Like I previously did for `reverse` (#90821), this leaves it to LLVM to pick how to vectorize it, since it can know better the chunk size to use, compared to the "32 bytes always" approach we currently have.

A variety of codegen tests are included to confirm that the various cases are still being vectorized.

It does still need logic to type-erase in some cases, though, as while LLVM is now smart enough to vectorize over slices of things like `[u8; 4]`, it fails to do so over slices of `[u8; 3]`.

As a bonus, this change also means one no longer gets the spurious `memcpy`(s?) at the end up swapping a slice of `__m256`s: <https://rust.godbolt.org/z/joofr4v8Y>

<details>

<summary>ASM for this example</summary>

## Before (from godbolt)

note the `push`/`pop`s and `memcpy`

```x86
swap_m256_slice:
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 32
        cmp     rsi, rcx
        jne     .LBB0_6
        mov     r14, rsi
        shl     r14, 5
        je      .LBB0_6
        mov     r15, rdx
        mov     rbx, rdi
        xor     eax, eax
.LBB0_3:
        mov     rcx, rax
        vmovaps ymm0, ymmword ptr [rbx + rax]
        vmovaps ymm1, ymmword ptr [r15 + rax]
        vmovaps ymmword ptr [rbx + rax], ymm1
        vmovaps ymmword ptr [r15 + rax], ymm0
        add     rax, 32
        add     rcx, 64
        cmp     rcx, r14
        jbe     .LBB0_3
        sub     r14, rax
        jbe     .LBB0_6
        add     rbx, rax
        add     r15, rax
        mov     r12, rsp
        mov     r13, qword ptr [rip + memcpy@GOTPCREL]
        mov     rdi, r12
        mov     rsi, rbx
        mov     rdx, r14
        vzeroupper
        call    r13
        mov     rdi, rbx
        mov     rsi, r15
        mov     rdx, r14
        call    r13
        mov     rdi, r15
        mov     rsi, r12
        mov     rdx, r14
        call    r13
.LBB0_6:
        add     rsp, 32
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        vzeroupper
        ret
```

## After (from my machine)

Note no `rsp` manipulation, sorry for different ASM syntax

```x86
swap_m256_slice:
	cmpq	%r9, %rdx
	jne	.LBB1_6
	testq	%rdx, %rdx
	je	.LBB1_6
	cmpq	$1, %rdx
	jne	.LBB1_7
	xorl	%r10d, %r10d
	jmp	.LBB1_4
.LBB1_7:
	movq	%rdx, %r9
	andq	$-2, %r9
	movl	$32, %eax
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB1_8:
	vmovaps	-32(%rcx,%rax), %ymm0
	vmovaps	-32(%r8,%rax), %ymm1
	vmovaps	%ymm1, -32(%rcx,%rax)
	vmovaps	%ymm0, -32(%r8,%rax)
	vmovaps	(%rcx,%rax), %ymm0
	vmovaps	(%r8,%rax), %ymm1
	vmovaps	%ymm1, (%rcx,%rax)
	vmovaps	%ymm0, (%r8,%rax)
	addq	$2, %r10
	addq	$64, %rax
	cmpq	%r10, %r9
	jne	.LBB1_8
.LBB1_4:
	testb	$1, %dl
	je	.LBB1_6
	shlq	$5, %r10
	vmovaps	(%rcx,%r10), %ymm0
	vmovaps	(%r8,%r10), %ymm1
	vmovaps	%ymm1, (%rcx,%r10)
	vmovaps	%ymm0, (%r8,%r10)
.LBB1_6:
	vzeroupper
	retq
```

</details>

This does all its copying operations as either the original type or as `MaybeUninit`s, so as far as I know there should be no potential abstract machine issues with reading padding bytes as integers.

<details>

<summary>Perf is essentially unchanged</summary>

Though perhaps with more target features this would help more, if it could pick bigger chunks

## Before

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         894 ns/iter (+/- 11)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,476 ns/iter (+/- 2,784)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,257 ns/iter (+/- 7)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,922 ns/iter (+/- 959)
test slice::swap_with_slice_rgb_30                                 ... bench:         328 ns/iter (+/- 27)
test slice::swap_with_slice_rgb_3000                               ... bench:      16,215 ns/iter (+/- 176)
test slice::swap_with_slice_u8_30                                  ... bench:         312 ns/iter (+/- 9)
test slice::swap_with_slice_u8_3000                                ... bench:       5,401 ns/iter (+/- 123)
test slice::swap_with_slice_usize_30                               ... bench:         368 ns/iter (+/- 3)
test slice::swap_with_slice_usize_3000                             ... bench:      28,472 ns/iter (+/- 3,913)
```

## After

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         868 ns/iter (+/- 36)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,642 ns/iter (+/- 1,507)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,194 ns/iter (+/- 11)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,761 ns/iter (+/- 5,018)
test slice::swap_with_slice_rgb_30                                 ... bench:         324 ns/iter (+/- 6)
test slice::swap_with_slice_rgb_3000                               ... bench:      15,962 ns/iter (+/- 287)
test slice::swap_with_slice_u8_30                                  ... bench:         281 ns/iter (+/- 5)
test slice::swap_with_slice_u8_3000                                ... bench:       5,324 ns/iter (+/- 40)
test slice::swap_with_slice_usize_30                               ... bench:         275 ns/iter (+/- 5)
test slice::swap_with_slice_usize_3000                             ... bench:      28,277 ns/iter (+/- 277)
```

</detail>
2022-02-24 21:42:14 +01:00
..
alloc Fix a bunch of typos 2021-12-14 16:40:43 +01:00
array Fix a typo in documentation of array::IntoIter::new_unchecked 2022-02-23 21:10:04 +03:00
async_iter Move {core,std}::stream::Stream to {core,std}::async_iter::AsyncIterator. 2022-02-03 21:03:06 +08:00
char fix 2022-02-17 22:14:54 -08:00
convert Rollup merge of #89869 - kpreid:from-doc, r=yaahc 2022-02-17 06:29:57 +01:00
fmt Suggest calling .display() on PathBuf too 2022-02-21 16:58:12 -08:00
future Rollup merge of #91192 - r00ster91:futuredocs, r=GuillaumeGomez 2022-02-21 19:36:46 +01:00
hash change PhantomData type for BuildHasherDefault 2022-01-07 00:39:48 +01:00
iter Add a try_collect() helper method to Iterator 2022-02-16 14:26:39 -08:00
macros add link to format_args! when being mentioned in doc 2022-02-12 12:35:30 +08:00
mem Stop manually SIMDing in swap_nonoverlapping 2022-02-21 00:54:02 -08:00
num Stabilise inherent_ascii_escape (FCP in #77174) 2022-02-12 13:21:59 -05:00
ops Rollup merge of #94283 - hellow554:stable_flow_control, r=Dylan-DPC 2022-02-24 07:48:08 +01:00
panic Rollup merge of #93613 - crlf0710:rename_to_async_iter, r=yaahc 2022-02-18 16:23:32 +01:00
prelude update cfg(bootstrap)s 2022-01-28 15:01:07 +01:00
ptr Stop manually SIMDing in swap_nonoverlapping 2022-02-21 00:54:02 -08:00
slice Rollup merge of #93686 - dbrgn:trim-on-byte-slices, r=joshtriplett 2022-02-20 00:37:23 +01:00
str Add {floor,ceil}_char_boundary methods to str 2022-02-07 13:34:08 -05:00
sync Rollup merge of #89869 - kpreid:from-doc, r=yaahc 2022-02-17 06:29:57 +01:00
task Rollup merge of #89869 - kpreid:from-doc, r=yaahc 2022-02-17 06:29:57 +01:00
unicode Regenerate tables for Unicode 14.0.0 2021-10-06 17:49:33 -07:00
any.rs Reverts #92135 because perf regression 2021-12-26 16:02:33 +03:00
ascii.rs Add #[must_use] to remaining core functions 2021-10-30 18:21:29 -04:00
bool.rs Constify bool::then{,_some} 2021-12-15 00:11:23 +08:00
borrow.rs Make Borrow and BorrowMut impls const 2021-12-04 21:57:39 +09:00
cell.rs Rollup merge of #89869 - kpreid:from-doc, r=yaahc 2022-02-17 06:29:57 +01:00
clone.rs Update Copy/Clone documentation WRT arrays 2021-11-08 13:11:59 -05:00
cmp.rs Edit docs introduction for std::cmp::PartialOrd 2022-01-28 00:46:04 -06:00
default.rs Add #[must_use] to remaining core functions 2021-10-30 18:21:29 -04:00
ffi.rs Use target_family = "wasm" 2021-11-10 08:35:42 -08:00
hint.rs Add is_riscv_feature_detected!; modify impl of hint::spin_loop 2022-01-05 15:44:52 +08:00
internal_macros.rs Added docs to internal_macro const 2021-10-22 10:07:35 +13:00
intrinsics.rs Document about some behaviors of const_(de)allocate and add some tests. 2022-01-29 19:13:23 +09:00
lazy.rs Rollup merge of #89869 - kpreid:from-doc, r=yaahc 2022-02-17 06:29:57 +01:00
lib.rs Rollup merge of #93613 - crlf0710:rename_to_async_iter, r=yaahc 2022-02-18 16:23:32 +01:00
marker.rs Update Copy/Clone documentation WRT arrays 2021-11-08 13:11:59 -05:00
option.rs Option::and_then basic example: show failure 2022-02-12 12:23:38 +08:00
panic.rs Fix invalid special casing of the unreachable! macro 2022-01-31 17:09:31 +01:00
panicking.rs Guard against unwinding in cleanup code 2022-02-13 03:10:09 +00:00
pin.rs Rollup merge of #94128 - mqy:master, r=Dylan-DPC 2022-02-23 12:26:40 +01:00
primitive.rs mv std libs to library/ 2020-07-27 19:51:13 -05:00
primitive_docs.rs Fix annotation of code blocks 2022-02-01 21:44:53 +00:00
result.rs Add note on Windows path behaviour 2022-02-12 12:52:42 +08:00
time.rs Improve Duration::try_from_secs_f32/64 accuracy by directly processing exponent and mantissa 2022-01-26 18:14:25 +03:00
tuple.rs mv std libs to library/ 2020-07-27 19:51:13 -05:00
unit.rs mv std libs to library/ 2020-07-27 19:51:13 -05:00