tests: adapt align-offset.rs for InstCombine improvements in LLVM 23
Upstream [has improved InstCombine](8d2078332c) so that it can shrink added constants using known zeroes, which caused a little bit of change in this test. As far as I can tell either output is fine, so we just accept both.
@rustbot label: +llvm-main
Upstream has improved InstCombine so that it can shrink added constants
using known zeroes, which caused a little bit of change in this test. As
far as I can tell either output is fine, so we just accept both.
tests will check:
- correct emit of assembly for softfloat target
- incompatible set features will emit warnings/errors
- incompatible target tripples in crates will not link
This target is intended to be used for kernel development. Becasue on s390x
float and vector registers overlap we have to disable the vector extension.
The default s390x-unknown-gnu-linux target will not allow use of
softfloat.
Co-authored-by: Jubilee <workingjubilee@gmail.com>
The existing `aarch64-unknown-none` target assumes Armv8.0-A as a baseline. However, Arm recently released the Arm Cortex-R82 processor which is the first to implement the Armv8-R AArch64 mode architecture. This architecture is similar to Armv8-A AArch64, however it has a different set of mandatory features, and is based off of Armv8.4. It is largely unrelated to the existing Armv8-R architecture target (`armv8r-none-eabihf`), which only operates in AArch32 mode.
The second `aarch64v8r-unknown-none-softfloat` target allows for possible Armv8-R AArch64 CPUs with no FPU, or for use-cases where FPU register stacking is not desired. As with the existing `aarch64-unknown-none` target we have coupled FPU support and Neon support together - there is no 'has FPU but does not have NEON' target proposed even though the architecture technically allows for it.
This PR was developed by Ferrous Systems on behalf of Arm. Arm is the owner of these changes.
The SSE2 helper function is not inlined across crate boundaries,
so we cannot verify the codegen in an assembly test. The fix is
still verified by the absence of performance regression.
Fix cstring-merging test for Hexagon target
Hexagon assembler uses `.string` directive instead of `.asciz` for null-terminated strings. Both are equivalent but the test was only checking for `.asciz`.
Update the CHECK patterns to accept both directives using `.{{asciz|string}}` regex pattern.
Three targets, covering A32 and T32 instructions, and soft-float and
hard-float ABIs. Hard-float not available in Thumb mode. Atomics
in Thumb mode require __sync* functions from compiler-builtins.
Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native
## Summary
This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`.
On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics.
This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted.
## Problem
When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb`
instruction. This causes a **~22x performance regression**.
Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs.
## Root cause (upstream)
The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns.
An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906
Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern.
## Solution
Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features.
## Godbolt Links (Rust 1.92)
| Pattern | Target | Link | Result |
|---------|--------|------|--------|
| Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` |
| Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) |
| SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` |
| SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) |
## Benchmark Results
**CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512)
### Default Target (SSE2) — Mixed
| Size | Before | After | Change |
|------|--------|-------|--------|
| 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** |
| 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** |
| 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** |
| 32 B | 17.7 GB/s | 15.8 GB/s | -11% |
| 64 B | 28.6 GB/s | 25.1 GB/s | -12% |
| 256 B | 51.5 GB/s | 48.6 GB/s | ~same |
| 1 KB | 64.9 GB/s | 60.7 GB/s | ~same |
| 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same |
### Native Target (AVX-512) — Up to 24x Faster
| Size | Before | After | Speedup |
|------|--------|-------|---------|
| 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** |
| 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** |
| 16 B | ~7 GB/s | ~7 GB/s | ~same |
| 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** |
| 64 B | 2.9 GB/s | 23.2 GB/s | **8x** |
| 256 B | 2.9 GB/s | 47.2 GB/s | **16x** |
| 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** |
| 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** |
### Summary
- **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged
- **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s)
Note: this is the pure ascii path, but the story is similar for the others.
See linked bench project.
## Test Plan
- [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512
- [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there)
- [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations)
- [x] Benchmarks confirm performance improvement
- [x] Tidy checks pass
## Reproduction / Test Projects
Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation
- `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison
- `fuzz/` - Compares old/new implementations with libfuzzer
## Related Issues
- issue opened by @folkertdev llvm/llvm-project#176906
- Regression introduced in https://github.com/rust-lang/rust/pull/130733
Add Tier 3 Thumb-mode targets for Armv7-A, Armv7-R and Armv8-R
We currently have targets for bare-metal Armv7-R, Armv7-A and Armv8-R, but only in Arm mode. This PR adds five new targets enabling bare-metal support on these architectures in Thumb mode.
This has been tested using https://github.com/rust-embedded/aarch32/compare/main...thejpster:aarch32:support-thumb-mode-v7-v8?expand=1 and they all seem to work as expected.
However, I wasn't sure what to do with the maintainer lists as these are five new targets, but they share the docs page with the existing Arm versions. I can ask the Embedded Devices WG Arm Team about taking on these ones too, but whether Arm themselves want to take them on I guess is a bigger question.
Hexagon assembler uses `.string` directive instead of `.asciz` for
null-terminated strings. Both are equivalent but the test was only
checking for `.asciz`.
Update the CHECK patterns to accept both directives using
`.{{asciz|string}}` regex pattern.
Add -Z large-data-threshold
This flag allows specifying the threshold size for placing static data in large data sections when using the medium code model on x86-64.
When using -Ccode-model=medium, data smaller than this threshold uses RIP-relative addressing (32-bit offsets), while larger data uses absolute 64-bit addressing. This allows the compiler to generate more efficient code for smaller data while still supporting data larger than 2GB.
This mirrors the -mlarge-data-threshold flag available in GCC and Clang. The default threshold is 65536 bytes (64KB) if not specified, matching LLVM's default behavior.
Combine the x86_64 and loongarch64 is_ascii tests into a single file
using compiletest revisions. Both now test assembly output:
- X86_64: Verifies no broken kshiftrd/kshiftrq instructions (AVX-512 fix)
- LA64: Verifies vmskltz.b instruction is used (auto-vectorization)
As suggested, in order to distribute sanitizer instrumented standard
libraries without introducing new rustc flags, this adds a new dedicated
target. With the target, we can distribute the instrumented standard
libraries through a separate rustup component.
When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on
AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is
marked `#[inline]`, it gets inlined and recompiled with the user's
target settings. The previous implementation used a counting loop that
LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled,
LLVM uses k-registers and extracts bits individually with ~31
`kshiftrd` instructions.
This fix replaces the counting loop with explicit SSE2 intrinsics
(`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64.
`_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen
regardless of CPU features.
Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512):
- Default build: ~73 GB/s → ~74 GB/s (no regression)
- With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement)
The loongarch64 implementation retains the original counting loop
since it doesn't have this issue.
Regression from: https://github.com/rust-lang/rust/pull/130733
Add `f16` inline ASM support for s390x
tracking issue: https://github.com/rust-lang/rust/issues/116909
cc https://github.com/rust-lang/rust/issues/125398
Support the `f16x8` type in inline assembly. Only with the `nnp-assist` feature are there any instructions that make use of this type. Based on the riscv implementation I now cast to `i16x8` when that feature is not enabled.
As far as I'm aware there are no instructions operating on `f16` scalar values. Should we still add support for using them in inline assembly?
r? @tgross35
cc @uweigand
adding Ordering enum to minicore.rs, importing minicore in "tests/assembly-llvm/rust-abi-arg-attr.rs" test file
this adds the `Ordering` enum to `minicore.rs`.
consequently, this updates `tests/assembly-llvm/rust-abi-arg-attr.rs` to import `minicore` directly. previously, this test file contained traits like `Copy` `Clone` `PointeeSized`, which were giving a duplicate lang item error, so replace those by importing `minicore` completely.
This flag allows specifying the threshold size for placing static data
in large data sections when using the medium code model on x86-64.
When using -Ccode-model=medium, data smaller than this threshold uses
RIP-relative addressing (32-bit offsets), while larger data uses
absolute 64-bit addressing. This allows the compiler to generate more
efficient code for smaller data while still supporting data larger than
2GB.
This mirrors the -mlarge-data-threshold flag available in GCC and Clang.
The default threshold is 65536 bytes (64KB) if not specified, matching
LLVM's default behavior.
Add regression test for #120189
This PR adds regression tests for rust-lang/rust#120189.
I added tests to verify vectorization of loops inside closures.
Set -Cpanic=abort in windows-msvc stack protector tests
I ran into a test failure with the 32-bit windows test on https://github.com/rust-lang/rust/pull/117192, one of the tests has been incorrectly passing (until my change!) because it is picking up the stack protector from another function. I've tried to prevent that happening again by adding CHECK-DAGs for the start and end of each function.
I've also done my best to correct the comments, some were based on the fact that we used to run these tests with unwinding panics, but LLVM doesn't add protectors to function with SEH funclets so it's must more straightforward for these tests to use `-Cpanic=abort`.
Test the coexistence of 'stack-protector' and 'safe-stack'
This is a test to detect the coexistence of 'stack-protector' and 'safe-stack', and it's a supplement to pr rust-lang/rust#147115 . After the solution to issue rust-lang/rust#149340, I rewrote a version using minicore to circumvent the 'abi_mismatch' error.
r? `@SparrowLii` (Do you have time to review it?)
Rust's current mangling scheme depends on compiler internals; loses
information about generic parameters (and other things) which makes for
a worse experience when using external tools that need to interact with
Rust symbol names; is inconsistent; and can contain `.` characters
which aren't universally supported. Therefore, Rust has defined its own
symbol mangling scheme which is defined in terms of the Rust language,
not the compiler implementation; encodes information about generic
parameters in a reversible way; has a consistent definition; and
generates symbols that only use the characters `A-Z`, `a-z`, `0-9`, and
`_`.
Support for the new Rust symbol mangling scheme has been added to
upstream tools that will need to interact with Rust symbols (e.g.
debuggers).
This commit changes the default symbol mangling scheme from the legacy
scheme to the new Rust mangling scheme.
Signed-off-by: David Wood <david.wood@huawei.com>
Add alignment parameter to `simd_masked_{load,store}`
This PR adds an alignment parameter in `simd_masked_load` and `simd_masked_store`, in the form of a const-generic enum `core::intrinsics::simd::SimdAlign`. This represents the alignment of the `ptr` argument in these intrinsics as follows
- `SimdAlign::Unaligned` - `ptr` is unaligned/1-byte aligned
- `SimdAlign::Element` - `ptr` is aligned to the element type of the SIMD vector (default behavior in the old signature)
- `SimdAlign::Vector` - `ptr` is aligned to the SIMD vector type
The main motive for this is stdarch - most vector loads are either fully aligned (to the vector size) or unaligned (byte-aligned), so the previous signature doesn't cut it.
Now, stdarch will mostly use `SimdAlign::Unaligned` and `SimdAlign::Vector`, whereas portable-simd will use `SimdAlign::Element`.
- [x] `cg_llvm`
- [x] `cg_clif`
- [x] `miri`/`const_eval`
## Alternatives
Using a const-generic/"const" `u32` parameter as alignment (and we error during codegen if this argument is not a power of two). This, although more flexible than this, has a few drawbacks
- If we use an const-generic argument, then portable-simd somehow needs to pass `align_of::<T>()` as the alignment, which isn't possible without GCE
- "const" function parameters are just an ugly hack, and a pain to deal with in non-LLVM backends
We can remedy the problem with the const-generic `u32` parameter by adding a special rule for the element alignment case (e.g. `0` can mean "use the alignment of the element type), but I feel like this is not as expressive as the enum approach, although I am open to suggestions
cc `@workingjubilee` `@RalfJung` `@BoxyUwU`