Move from #[inline(always)] to #[inline] (#306)

* Move from #[inline(always)] to #[inline] This commit blanket changes all `#[inline(always)]` annotations to `#[inline]`. Fear not though, this should not be a regression! To clarify, though, this change is done out of correctness to ensure that we don't hit stray LLVM errors. Most of the LLVM intrinsics and various LLVM functions we actually lower down to only work correctly if they are invoked from a function with an appropriate target feature set. For example if we were to out-of-the-blue invoke an AVX intrinsic then we get a [codegen error][avx-error]. This error comes about because the surrounding function isn't enabling the AVX feature. Now in general we don't have a lot of control over how this crate is consumed by downstream crates. It'd be a pretty bad mistake if all mistakes showed up as scary un-debuggable codegen errors in LLVM! On the other side of this issue *we* as the invokers of these intrinsics are "doing the right thing". All our functions in this crate are tagged appropriately with target features to be codegen'd correctly. Indeed we have plenty of tests asserting that we can codegen everything across multiple platforms! The error comes about here because of precisely the `#[inline(always)]` attribute. Typically LLVM *won't* inline functions across target feature sets. For example if you have a normal function which calls a function that enables AVX2, then the target, no matter how small, won't be inlined into the caller. This is done for correctness (register preserving and all that) but is also how these codegen errors are prevented in practice. Now we as stdsimd, however, are currently tagging all functions with "always inline this, no matter what". That ends up, apparently, bypassing the logic of "is this even possible to inline". In turn we start inlining things like AVX intrinsics into functions that can't actually call AVX intrinsics, creating codegen errors at compile time. So with all that motivation, this commit switches to the normal inline hints for these functions, just `#[inline]`, instead of `#[inline(always)]`. Now for the stdsimd crate it is absolutely critical that all functions are inlined to have good performance. Using `#[inline]`, however, shouldn't hamper that! The compiler will recognize the `#[inline]` attribute and make sure that each of these functions is *candidate* to being inlined into any and all downstream codegen units. (aka if we were missing `#[inline]` then LLVM wouldn't even know the definition to inline most of the time). After that, though, we're relying on LLVM to naturally inline these functions as opposed to forcing it to do so. Typically, however, these intrinsics are one-liners and are trivially inlineable, so I'd imagine that LLVM will go ahead and inline everything all over the place. All in all this change is brought about by #253 which noticed various codegen errors. I originally thought it was due to ABI issues but turned out to be wrong! (although that was also a bug which has since been resolved). In any case after this change I was able to get the example in #253 to execute in both release and debug mode. Closes #253 [avx-error]: https://play.rust-lang.org/?gist=50cb08f1e2242e22109a6d69318bd112&version=nightly * Add inline(always) on eflags intrinsics Their ABI actually relies on it! * Leave #[inline(always)] on portable types They're causing test failures on ARM, let's investigate later.
2018-01-28 23:40:39 -06:00 · 2018-01-28 23:40:39 -06:00 · 82acb0c953
commit 82acb0c953
parent a2403de290
41 changed files with 1046 additions and 1046 deletions
--- a/library/stdarch/coresimd/src/aarch64/neon.rs
+++ b/library/stdarch/coresimd/src/aarch64/neon.rs
@ -8,7 +8,7 @@ use simd_llvm::simd_add;
 use v128::f64x2;

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(fadd))]
 pub unsafe fn vadd_f64(a: f64, b: f64) -> f64 {
@ -16,7 +16,7 @@ pub unsafe fn vadd_f64(a: f64, b: f64) -> f64 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(fadd))]
 pub unsafe fn vaddq_f64(a: f64x2, b: f64x2) -> f64x2 {
@ -24,7 +24,7 @@ pub unsafe fn vaddq_f64(a: f64x2, b: f64x2) -> f64x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddd_s64(a: i64, b: i64) -> i64 {
@ -32,7 +32,7 @@ pub unsafe fn vaddd_s64(a: i64, b: i64) -> i64 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddd_u64(a: u64, b: u64) -> u64 {
--- a/library/stdarch/coresimd/src/aarch64/v8.rs
+++ b/library/stdarch/coresimd/src/aarch64/v8.rs
@ -9,14 +9,14 @@
 use stdsimd_test::assert_instr;

 /// Reverse the order of the bytes.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rev))]
 pub unsafe fn _rev_u64(x: u64) -> u64 {
    x.swap_bytes() as u64
 }

 /// Count Leading Zeros.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(clz))]
 pub unsafe fn _clz_u64(x: u64) -> u64 {
    x.leading_zeros() as u64
@ -29,7 +29,7 @@ extern "C" {
 }

 /// Reverse the bit order.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rbit))]
 pub unsafe fn _rbit_u64(x: u64) -> u64 {
    rbit_u64(x as i64) as u64
@ -39,7 +39,7 @@ pub unsafe fn _rbit_u64(x: u64) -> u64 {
 ///
 /// When all bits of the operand are set it returns the size of the operand in
 /// bits.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(cls))]
 pub unsafe fn _cls_u32(x: u32) -> u32 {
    u32::leading_zeros((((((x as i32) >> 31) as u32) ^ x) << 1) | 1) as u32
@ -49,7 +49,7 @@ pub unsafe fn _cls_u32(x: u32) -> u32 {
 ///
 /// When all bits of the operand are set it returns the size of the operand in
 /// bits.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(cls))]
 pub unsafe fn _cls_u64(x: u64) -> u64 {
    u64::leading_zeros((((((x as i64) >> 63) as u64) ^ x) << 1) | 1) as u64
--- a/library/stdarch/coresimd/src/arm/neon.rs
+++ b/library/stdarch/coresimd/src/arm/neon.rs
@ -9,7 +9,7 @@ use v64::{f32x2, i16x4, i32x2, i8x8, u16x4, u32x2, u8x8};
 use v128::{f32x4, i16x8, i32x4, i64x2, i8x16, u16x8, u32x4, u64x2, u8x16};

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_s8(a: i8x8, b: i8x8) -> i8x8 {
@ -17,7 +17,7 @@ pub unsafe fn vadd_s8(a: i8x8, b: i8x8) -> i8x8 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_s8(a: i8x16, b: i8x16) -> i8x16 {
@ -25,7 +25,7 @@ pub unsafe fn vaddq_s8(a: i8x16, b: i8x16) -> i8x16 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_s16(a: i16x4, b: i16x4) -> i16x4 {
@ -33,7 +33,7 @@ pub unsafe fn vadd_s16(a: i16x4, b: i16x4) -> i16x4 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_s16(a: i16x8, b: i16x8) -> i16x8 {
@ -41,7 +41,7 @@ pub unsafe fn vaddq_s16(a: i16x8, b: i16x8) -> i16x8 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_s32(a: i32x2, b: i32x2) -> i32x2 {
@ -49,7 +49,7 @@ pub unsafe fn vadd_s32(a: i32x2, b: i32x2) -> i32x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_s32(a: i32x4, b: i32x4) -> i32x4 {
@ -57,7 +57,7 @@ pub unsafe fn vaddq_s32(a: i32x4, b: i32x4) -> i32x4 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_s64(a: i64x2, b: i64x2) -> i64x2 {
@ -65,7 +65,7 @@ pub unsafe fn vaddq_s64(a: i64x2, b: i64x2) -> i64x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_u8(a: u8x8, b: u8x8) -> u8x8 {
@ -73,7 +73,7 @@ pub unsafe fn vadd_u8(a: u8x8, b: u8x8) -> u8x8 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_u8(a: u8x16, b: u8x16) -> u8x16 {
@ -81,7 +81,7 @@ pub unsafe fn vaddq_u8(a: u8x16, b: u8x16) -> u8x16 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_u16(a: u16x4, b: u16x4) -> u16x4 {
@ -89,7 +89,7 @@ pub unsafe fn vadd_u16(a: u16x4, b: u16x4) -> u16x4 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_u16(a: u16x8, b: u16x8) -> u16x8 {
@ -97,7 +97,7 @@ pub unsafe fn vaddq_u16(a: u16x8, b: u16x8) -> u16x8 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vadd_u32(a: u32x2, b: u32x2) -> u32x2 {
@ -105,7 +105,7 @@ pub unsafe fn vadd_u32(a: u32x2, b: u32x2) -> u32x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_u32(a: u32x4, b: u32x4) -> u32x4 {
@ -113,7 +113,7 @@ pub unsafe fn vaddq_u32(a: u32x4, b: u32x4) -> u32x4 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(add))]
 pub unsafe fn vaddq_u64(a: u64x2, b: u64x2) -> u64x2 {
@ -121,7 +121,7 @@ pub unsafe fn vaddq_u64(a: u64x2, b: u64x2) -> u64x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(fadd))]
 pub unsafe fn vadd_f32(a: f32x2, b: f32x2) -> f32x2 {
@ -129,7 +129,7 @@ pub unsafe fn vadd_f32(a: f32x2, b: f32x2) -> f32x2 {
 }

 /// Vector add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(fadd))]
 pub unsafe fn vaddq_f32(a: f32x4, b: f32x4) -> f32x4 {
@ -137,7 +137,7 @@ pub unsafe fn vaddq_f32(a: f32x4, b: f32x4) -> f32x4 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(saddl))]
 pub unsafe fn vaddl_s8(a: i8x8, b: i8x8) -> i16x8 {
@ -147,7 +147,7 @@ pub unsafe fn vaddl_s8(a: i8x8, b: i8x8) -> i16x8 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(saddl))]
 pub unsafe fn vaddl_s16(a: i16x4, b: i16x4) -> i32x4 {
@ -157,7 +157,7 @@ pub unsafe fn vaddl_s16(a: i16x4, b: i16x4) -> i32x4 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(saddl))]
 pub unsafe fn vaddl_s32(a: i32x2, b: i32x2) -> i64x2 {
@ -167,7 +167,7 @@ pub unsafe fn vaddl_s32(a: i32x2, b: i32x2) -> i64x2 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(uaddl))]
 pub unsafe fn vaddl_u8(a: u8x8, b: u8x8) -> u16x8 {
@ -177,7 +177,7 @@ pub unsafe fn vaddl_u8(a: u8x8, b: u8x8) -> u16x8 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(uaddl))]
 pub unsafe fn vaddl_u16(a: u16x4, b: u16x4) -> u32x4 {
@ -187,7 +187,7 @@ pub unsafe fn vaddl_u16(a: u16x4, b: u16x4) -> u32x4 {
 }

 /// Vector long add.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(uaddl))]
 pub unsafe fn vaddl_u32(a: u32x2, b: u32x2) -> u64x2 {
@ -205,7 +205,7 @@ extern "C" {
 }

 /// Reciprocal square-root estimate.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "neon")]
 #[cfg_attr(test, assert_instr(frsqrte))]
 pub unsafe fn vrsqrte_f32(a: f32x2) -> f32x2 {
--- a/library/stdarch/coresimd/src/arm/v6.rs
+++ b/library/stdarch/coresimd/src/arm/v6.rs
@ -10,14 +10,14 @@
 use stdsimd_test::assert_instr;

 /// Reverse the order of the bytes.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rev))]
 pub unsafe fn _rev_u16(x: u16) -> u16 {
    x.swap_bytes() as u16
 }

 /// Reverse the order of the bytes.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rev))]
 pub unsafe fn _rev_u32(x: u32) -> u32 {
    x.swap_bytes() as u32
--- a/library/stdarch/coresimd/src/arm/v7.rs
+++ b/library/stdarch/coresimd/src/arm/v7.rs
@ -13,28 +13,28 @@ pub use super::v6::*;
 use stdsimd_test::assert_instr;

 /// Count Leading Zeros.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(clz))]
 pub unsafe fn _clz_u8(x: u8) -> u8 {
    x.leading_zeros() as u8
 }

 /// Count Leading Zeros.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(clz))]
 pub unsafe fn _clz_u16(x: u16) -> u16 {
    x.leading_zeros() as u16
 }

 /// Count Leading Zeros.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(clz))]
 pub unsafe fn _clz_u32(x: u32) -> u32 {
    x.leading_zeros() as u32
 }

 /// Reverse the bit order.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rbit))]
 #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
 #[cfg(dont_compile_me)] // FIXME need to add `v7` upstream in rustc
--- a/library/stdarch/coresimd/src/nvptx/mod.rs
+++ b/library/stdarch/coresimd/src/nvptx/mod.rs
@ -42,79 +42,79 @@ extern "C" {
 }

 /// Synchronizes all threads in the block.
-#[inline(always)]
+#[inline]
 pub unsafe fn _syncthreads() -> () {
    syncthreads()
 }

 /// x-th thread-block dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_dim_x() -> i32 {
    block_dim_x()
 }

 /// y-th thread-block dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_dim_y() -> i32 {
    block_dim_y()
 }

 /// z-th thread-block dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_dim_z() -> i32 {
    block_dim_z()
 }

 /// x-th thread-block index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_idx_x() -> i32 {
    block_idx_x()
 }

 /// y-th thread-block index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_idx_y() -> i32 {
    block_idx_y()
 }

 /// z-th thread-block index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _block_idx_z() -> i32 {
    block_idx_z()
 }

 /// x-th block-grid dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _grid_dim_x() -> i32 {
    grid_dim_x()
 }

 /// y-th block-grid dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _grid_dim_y() -> i32 {
    grid_dim_y()
 }

 /// z-th block-grid dimension.
-#[inline(always)]
+#[inline]
 pub unsafe fn _grid_dim_z() -> i32 {
    grid_dim_z()
 }

 /// x-th thread index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _thread_idx_x() -> i32 {
    thread_idx_x()
 }

 /// y-th thread index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _thread_idx_y() -> i32 {
    thread_idx_y()
 }

 /// z-th thread index.
-#[inline(always)]
+#[inline]
 pub unsafe fn _thread_idx_z() -> i32 {
    thread_idx_z()
 }
--- a/library/stdarch/coresimd/src/x86/i386/fxsr.rs
+++ b/library/stdarch/coresimd/src/x86/i386/fxsr.rs
@ -21,7 +21,7 @@ extern "C" {
 ///
 /// [fxsave]: http://www.felixcloutier.com/x86/FXSAVE.html
 /// [fxrstor]: http://www.felixcloutier.com/x86/FXRSTOR.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "fxsr")]
 #[cfg_attr(test, assert_instr(fxsave))]
 pub unsafe fn _fxsave(mem_addr: *mut u8) {
@ -42,7 +42,7 @@ pub unsafe fn _fxsave(mem_addr: *mut u8) {
 ///
 /// [fxsave]: http://www.felixcloutier.com/x86/FXSAVE.html
 /// [fxrstor]: http://www.felixcloutier.com/x86/FXRSTOR.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "fxsr")]
 #[cfg_attr(test, assert_instr(fxrstor))]
 pub unsafe fn _fxrstor(mem_addr: *const u8) {
--- a/library/stdarch/coresimd/src/x86/i586/abm.rs
+++ b/library/stdarch/coresimd/src/x86/i586/abm.rs
@ -23,7 +23,7 @@ use stdsimd_test::assert_instr;
 /// Counts the leading most significant zero bits.
 ///
 /// When the operand is zero, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "lzcnt")]
 #[cfg_attr(test, assert_instr(lzcnt))]
 pub unsafe fn _lzcnt_u32(x: u32) -> u32 {
@ -31,7 +31,7 @@ pub unsafe fn _lzcnt_u32(x: u32) -> u32 {
 }

 /// Counts the bits that are set.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "popcnt")]
 #[cfg_attr(test, assert_instr(popcnt))]
 pub unsafe fn _popcnt32(x: i32) -> i32 {
--- a/library/stdarch/coresimd/src/x86/i586/avx.rs
+++ b/library/stdarch/coresimd/src/x86/i586/avx.rs
--- a/library/stdarch/coresimd/src/x86/i586/avx2.rs
+++ b/library/stdarch/coresimd/src/x86/i586/avx2.rs
--- a/library/stdarch/coresimd/src/x86/i586/bmi.rs
+++ b/library/stdarch/coresimd/src/x86/i586/bmi.rs
@ -14,7 +14,7 @@ use stdsimd_test::assert_instr;

 /// Extracts bits in range [`start`, `start` + `length`) from `a` into
 /// the least significant bits of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(bextr))]
 pub unsafe fn _bextr_u32(a: u32, start: u32, len: u32) -> u32 {
@ -26,7 +26,7 @@ pub unsafe fn _bextr_u32(a: u32, start: u32, len: u32) -> u32 {
 ///
 /// Bits [7,0] of `control` specify the index to the first bit in the range to
 /// be extracted, and bits [15,8] specify the length of the range.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(bextr))]
 pub unsafe fn _bextr2_u32(a: u32, control: u32) -> u32 {
@ -34,7 +34,7 @@ pub unsafe fn _bextr2_u32(a: u32, control: u32) -> u32 {
 }

 /// Bitwise logical `AND` of inverted `a` with `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(andn))]
 pub unsafe fn _andn_u32(a: u32, b: u32) -> u32 {
@ -42,7 +42,7 @@ pub unsafe fn _andn_u32(a: u32, b: u32) -> u32 {
 }

 /// Extract lowest set isolated bit.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsi))]
 pub unsafe fn _blsi_u32(x: u32) -> u32 {
@ -50,7 +50,7 @@ pub unsafe fn _blsi_u32(x: u32) -> u32 {
 }

 /// Get mask up to lowest set bit.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsmsk))]
 pub unsafe fn _blsmsk_u32(x: u32) -> u32 {
@ -60,7 +60,7 @@ pub unsafe fn _blsmsk_u32(x: u32) -> u32 {
 /// Resets the lowest set bit of `x`.
 ///
 /// If `x` is sets CF.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsr))]
 pub unsafe fn _blsr_u32(x: u32) -> u32 {
@ -70,7 +70,7 @@ pub unsafe fn _blsr_u32(x: u32) -> u32 {
 /// Counts the number of trailing least significant zero bits.
 ///
 /// When the source operand is 0, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(tzcnt))]
 pub unsafe fn _tzcnt_u32(x: u32) -> u32 {
@ -80,7 +80,7 @@ pub unsafe fn _tzcnt_u32(x: u32) -> u32 {
 /// Counts the number of trailing least significant zero bits.
 ///
 /// When the source operand is 0, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(tzcnt))]
 pub unsafe fn _mm_tzcnt_32(x: u32) -> i32 {
--- a/library/stdarch/coresimd/src/x86/i586/bmi2.rs
+++ b/library/stdarch/coresimd/src/x86/i586/bmi2.rs
@ -17,7 +17,7 @@ use stdsimd_test::assert_instr;
 ///
 /// Unsigned multiplication of `a` with `b` returning a pair `(lo, hi)` with
 /// the low half and the high half of the result.
-#[inline(always)]
+#[inline]
 // LLVM BUG (should be mulxl): https://bugs.llvm.org/show_bug.cgi?id=34232
 #[cfg_attr(all(test, target_arch = "x86_64"), assert_instr(imul))]
 #[cfg_attr(all(test, target_arch = "x86"), assert_instr(mulx))]
@ -29,7 +29,7 @@ pub unsafe fn _mulx_u32(a: u32, b: u32, hi: &mut u32) -> u32 {
 }

 /// Zero higher bits of `a` >= `index`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(bzhi))]
 pub unsafe fn _bzhi_u32(a: u32, index: u32) -> u32 {
@ -38,7 +38,7 @@ pub unsafe fn _bzhi_u32(a: u32, index: u32) -> u32 {

 /// Scatter contiguous low order bits of `a` to the result at the positions
 /// specified by the `mask`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(pdep))]
 pub unsafe fn _pdep_u32(a: u32, mask: u32) -> u32 {
@ -47,7 +47,7 @@ pub unsafe fn _pdep_u32(a: u32, mask: u32) -> u32 {

 /// Gathers the bits of `x` specified by the `mask` into the contiguous low
 /// order bit positions of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(pext))]
 pub unsafe fn _pext_u32(a: u32, mask: u32) -> u32 {
--- a/library/stdarch/coresimd/src/x86/i586/bswap.rs
+++ b/library/stdarch/coresimd/src/x86/i586/bswap.rs
@ -6,14 +6,14 @@
 use stdsimd_test::assert_instr;

 /// Return an integer with the reversed byte order of x
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(bswap))]
 pub unsafe fn _bswap(x: i32) -> i32 {
    bswap_i32(x)
 }

 /// Return an integer with the reversed byte order of x
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(bswap))]
 pub unsafe fn _bswap64(x: i64) -> i64 {
    bswap_i64(x)
--- a/library/stdarch/coresimd/src/x86/i586/cpuid.rs
+++ b/library/stdarch/coresimd/src/x86/i586/cpuid.rs
@ -42,7 +42,7 @@ pub struct CpuidResult {
 /// [wiki_cpuid]: https://en.wikipedia.org/wiki/CPUID
 /// [intel64_ref]: http://www.intel.de/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
 /// [amd64_ref]: http://support.amd.com/TechDocs/24594.pdf
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(cpuid))]
 pub unsafe fn __cpuid_count(leaf: u32, sub_leaf: u32) -> CpuidResult {
    let mut r = ::core::mem::uninitialized::<CpuidResult>();
@ -62,14 +62,14 @@ pub unsafe fn __cpuid_count(leaf: u32, sub_leaf: u32) -> CpuidResult {
 }

 /// See [`__cpuid_count`](fn.__cpuid_count.html).
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(cpuid))]
 pub unsafe fn __cpuid(leaf: u32) -> CpuidResult {
    __cpuid_count(leaf, 0)
 }

 /// Does the host support the `cpuid` instruction?
-#[inline(always)]
+#[inline]
 pub fn has_cpuid() -> bool {
    #[cfg(target_arch = "x86_64")]
    {
@ -111,7 +111,7 @@ pub fn has_cpuid() -> bool {
 ///
 /// See also [`__cpuid`](fn.__cpuid.html) and
 /// [`__cpuid_count`](fn.__cpuid_count.html).
-#[inline(always)]
+#[inline]
 pub unsafe fn __get_cpuid_max(leaf: u32) -> (u32, u32) {
    let CpuidResult { eax, ebx, .. } = __cpuid(leaf);
    (eax, ebx)
--- a/library/stdarch/coresimd/src/x86/i586/rdtsc.rs
+++ b/library/stdarch/coresimd/src/x86/i586/rdtsc.rs
@ -15,7 +15,7 @@ use stdsimd_test::assert_instr;
 ///
 /// On processors that support the Intel 64 architecture, the
 /// high-order 32 bits of each of RAX and RDX are cleared.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rdtsc))]
 pub unsafe fn _rdtsc() -> u64 {
    rdtsc()
@ -35,7 +35,7 @@ pub unsafe fn _rdtsc() -> u64 {
 ///
 /// On processors that support the Intel 64 architecture, the
 /// high-order 32 bits of each of RAX, RDX, and RCX are cleared.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(rdtscp))]
 pub unsafe fn _rdtscp(aux: *mut u32) -> u64 {
    rdtscp(aux as *mut _)
--- a/library/stdarch/coresimd/src/x86/i586/sse.rs
+++ b/library/stdarch/coresimd/src/x86/i586/sse.rs
@ -13,7 +13,7 @@ use stdsimd_test::assert_instr;

 /// Adds the first component of `a` and `b`, the other components are copied
 /// from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(addss))]
 pub unsafe fn _mm_add_ss(a: __m128, b: __m128) -> __m128 {
@ -21,7 +21,7 @@ pub unsafe fn _mm_add_ss(a: __m128, b: __m128) -> __m128 {
 }

 /// Adds __m128 vectors.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(addps))]
 pub unsafe fn _mm_add_ps(a: __m128, b: __m128) -> __m128 {
@ -30,7 +30,7 @@ pub unsafe fn _mm_add_ps(a: __m128, b: __m128) -> __m128 {

 /// Subtracts the first component of `b` from `a`, the other components are
 /// copied from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(subss))]
 pub unsafe fn _mm_sub_ss(a: __m128, b: __m128) -> __m128 {
@ -38,7 +38,7 @@ pub unsafe fn _mm_sub_ss(a: __m128, b: __m128) -> __m128 {
 }

 /// Subtracts __m128 vectors.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(subps))]
 pub unsafe fn _mm_sub_ps(a: __m128, b: __m128) -> __m128 {
@ -47,7 +47,7 @@ pub unsafe fn _mm_sub_ps(a: __m128, b: __m128) -> __m128 {

 /// Multiplies the first component of `a` and `b`, the other components are
 /// copied from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(mulss))]
 pub unsafe fn _mm_mul_ss(a: __m128, b: __m128) -> __m128 {
@ -55,7 +55,7 @@ pub unsafe fn _mm_mul_ss(a: __m128, b: __m128) -> __m128 {
 }

 /// Multiplies __m128 vectors.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(mulps))]
 pub unsafe fn _mm_mul_ps(a: __m128, b: __m128) -> __m128 {
@ -64,7 +64,7 @@ pub unsafe fn _mm_mul_ps(a: __m128, b: __m128) -> __m128 {

 /// Divides the first component of `b` by `a`, the other components are
 /// copied from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(divss))]
 pub unsafe fn _mm_div_ss(a: __m128, b: __m128) -> __m128 {
@ -72,7 +72,7 @@ pub unsafe fn _mm_div_ss(a: __m128, b: __m128) -> __m128 {
 }

 /// Divides __m128 vectors.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(divps))]
 pub unsafe fn _mm_div_ps(a: __m128, b: __m128) -> __m128 {
@ -81,7 +81,7 @@ pub unsafe fn _mm_div_ps(a: __m128, b: __m128) -> __m128 {

 /// Return the square root of the first single-precision (32-bit)
 /// floating-point element in `a`, the other elements are unchanged.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(sqrtss))]
 pub unsafe fn _mm_sqrt_ss(a: __m128) -> __m128 {
@ -90,7 +90,7 @@ pub unsafe fn _mm_sqrt_ss(a: __m128) -> __m128 {

 /// Return the square root of packed single-precision (32-bit) floating-point
 /// elements in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(sqrtps))]
 pub unsafe fn _mm_sqrt_ps(a: __m128) -> __m128 {
@ -99,7 +99,7 @@ pub unsafe fn _mm_sqrt_ps(a: __m128) -> __m128 {

 /// Return the approximate reciprocal of the first single-precision
 /// (32-bit) floating-point element in `a`, the other elements are unchanged.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(rcpss))]
 pub unsafe fn _mm_rcp_ss(a: __m128) -> __m128 {
@ -108,7 +108,7 @@ pub unsafe fn _mm_rcp_ss(a: __m128) -> __m128 {

 /// Return the approximate reciprocal of packed single-precision (32-bit)
 /// floating-point elements in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(rcpps))]
 pub unsafe fn _mm_rcp_ps(a: __m128) -> __m128 {
@ -117,7 +117,7 @@ pub unsafe fn _mm_rcp_ps(a: __m128) -> __m128 {

 /// Return the approximate reciprocal square root of the fist single-precision
 /// (32-bit) floating-point elements in `a`, the other elements are unchanged.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(rsqrtss))]
 pub unsafe fn _mm_rsqrt_ss(a: __m128) -> __m128 {
@ -126,7 +126,7 @@ pub unsafe fn _mm_rsqrt_ss(a: __m128) -> __m128 {

 /// Return the approximate reciprocal square root of packed single-precision
 /// (32-bit) floating-point elements in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(rsqrtps))]
 pub unsafe fn _mm_rsqrt_ps(a: __m128) -> __m128 {
@ -136,7 +136,7 @@ pub unsafe fn _mm_rsqrt_ps(a: __m128) -> __m128 {
 /// Compare the first single-precision (32-bit) floating-point element of `a`
 /// and `b`, and return the minimum value in the first element of the return
 /// value, the other elements are copied from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(minss))]
 pub unsafe fn _mm_min_ss(a: __m128, b: __m128) -> __m128 {
@ -145,7 +145,7 @@ pub unsafe fn _mm_min_ss(a: __m128, b: __m128) -> __m128 {

 /// Compare packed single-precision (32-bit) floating-point elements in `a` and
 /// `b`, and return the corresponding minimum values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(minps))]
 pub unsafe fn _mm_min_ps(a: __m128, b: __m128) -> __m128 {
@ -155,7 +155,7 @@ pub unsafe fn _mm_min_ps(a: __m128, b: __m128) -> __m128 {
 /// Compare the first single-precision (32-bit) floating-point element of `a`
 /// and `b`, and return the maximum value in the first element of the return
 /// value, the other elements are copied from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(maxss))]
 pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 {
@ -164,7 +164,7 @@ pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 {

 /// Compare packed single-precision (32-bit) floating-point elements in `a` and
 /// `b`, and return the corresponding maximum values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(maxps))]
 pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 {
@ -172,7 +172,7 @@ pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 {
 }

 /// Bitwise AND of packed single-precision (32-bit) floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // i586 only seems to generate plain `and` instructions, so ignore it.
 #[cfg_attr(all(test, any(target_arch = "x86_64", target_feature = "sse2")),
@ -187,7 +187,7 @@ pub unsafe fn _mm_and_ps(a: __m128, b: __m128) -> __m128 {
 /// elements.
 ///
 /// Computes `!a & b` for each bit in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // i586 only seems to generate plain `not` and `and` instructions, so ignore
 // it.
@ -201,7 +201,7 @@ pub unsafe fn _mm_andnot_ps(a: __m128, b: __m128) -> __m128 {
 }

 /// Bitwise OR of packed single-precision (32-bit) floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // i586 only seems to generate plain `or` instructions, so we ignore it.
 #[cfg_attr(all(test, any(target_arch = "x86_64", target_feature = "sse2")),
@ -214,7 +214,7 @@ pub unsafe fn _mm_or_ps(a: __m128, b: __m128) -> __m128 {

 /// Bitwise exclusive OR of packed single-precision (32-bit) floating-point
 /// elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // i586 only seems to generate plain `xor` instructions, so we ignore it.
 #[cfg_attr(all(test, any(target_arch = "x86_64", target_feature = "sse2")),
@ -228,7 +228,7 @@ pub unsafe fn _mm_xor_ps(a: __m128, b: __m128) -> __m128 {
 /// Compare the lowest `f32` of both inputs for equality. The lowest 32 bits of
 /// the result will be `0xffffffff` if the two inputs are equal, or `0`
 /// otherwise. The upper 96 bits of the result are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpeqss))]
 pub unsafe fn _mm_cmpeq_ss(a: __m128, b: __m128) -> __m128 {
@ -239,7 +239,7 @@ pub unsafe fn _mm_cmpeq_ss(a: __m128, b: __m128) -> __m128 {
 /// of the result will be `0xffffffff` if `a.extract(0)` is less than
 /// `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result are the
 /// upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpltss))]
 pub unsafe fn _mm_cmplt_ss(a: __m128, b: __m128) -> __m128 {
@ -250,7 +250,7 @@ pub unsafe fn _mm_cmplt_ss(a: __m128, b: __m128) -> __m128 {
 /// 32 bits of the result will be `0xffffffff` if `a.extract(0)` is less than
 /// or equal `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result
 /// are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpless))]
 pub unsafe fn _mm_cmple_ss(a: __m128, b: __m128) -> __m128 {
@ -261,7 +261,7 @@ pub unsafe fn _mm_cmple_ss(a: __m128, b: __m128) -> __m128 {
 /// bits of the result will be `0xffffffff` if `a.extract(0)` is greater
 /// than `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result
 /// are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpltss))]
 pub unsafe fn _mm_cmpgt_ss(a: __m128, b: __m128) -> __m128 {
@ -272,7 +272,7 @@ pub unsafe fn _mm_cmpgt_ss(a: __m128, b: __m128) -> __m128 {
 /// lowest 32 bits of the result will be `0xffffffff` if `a.extract(0)` is
 /// greater than or equal `b.extract(0)`, or `0` otherwise. The upper 96 bits
 /// of the result are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpless))]
 pub unsafe fn _mm_cmpge_ss(a: __m128, b: __m128) -> __m128 {
@ -283,7 +283,7 @@ pub unsafe fn _mm_cmpge_ss(a: __m128, b: __m128) -> __m128 {
 /// of the result will be `0xffffffff` if `a.extract(0)` is not equal to
 /// `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result are the
 /// upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpneqss))]
 pub unsafe fn _mm_cmpneq_ss(a: __m128, b: __m128) -> __m128 {
@ -294,7 +294,7 @@ pub unsafe fn _mm_cmpneq_ss(a: __m128, b: __m128) -> __m128 {
 /// bits of the result will be `0xffffffff` if `a.extract(0)` is not less than
 /// `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result are the
 /// upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnltss))]
 pub unsafe fn _mm_cmpnlt_ss(a: __m128, b: __m128) -> __m128 {
@ -305,7 +305,7 @@ pub unsafe fn _mm_cmpnlt_ss(a: __m128, b: __m128) -> __m128 {
 /// lowest 32 bits of the result will be `0xffffffff` if `a.extract(0)` is not
 /// less than or equal to `b.extract(0)`, or `0` otherwise. The upper 96 bits
 /// of the result are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnless))]
 pub unsafe fn _mm_cmpnle_ss(a: __m128, b: __m128) -> __m128 {
@ -316,7 +316,7 @@ pub unsafe fn _mm_cmpnle_ss(a: __m128, b: __m128) -> __m128 {
 /// bits of the result will be `0xffffffff` if `a.extract(0)` is not greater
 /// than `b.extract(0)`, or `0` otherwise. The upper 96 bits of the result are
 /// the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnltss))]
 pub unsafe fn _mm_cmpngt_ss(a: __m128, b: __m128) -> __m128 {
@ -327,7 +327,7 @@ pub unsafe fn _mm_cmpngt_ss(a: __m128, b: __m128) -> __m128 {
 /// lowest 32 bits of the result will be `0xffffffff` if `a.extract(0)` is not
 /// greater than or equal to `b.extract(0)`, or `0` otherwise. The upper 96
 /// bits of the result are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnless))]
 pub unsafe fn _mm_cmpnge_ss(a: __m128, b: __m128) -> __m128 {
@ -338,7 +338,7 @@ pub unsafe fn _mm_cmpnge_ss(a: __m128, b: __m128) -> __m128 {
 /// the result will be `0xffffffff` if neither of `a.extract(0)` or
 /// `b.extract(0)` is a NaN, or `0` otherwise. The upper 96 bits of the result
 /// are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpordss))]
 pub unsafe fn _mm_cmpord_ss(a: __m128, b: __m128) -> __m128 {
@ -349,7 +349,7 @@ pub unsafe fn _mm_cmpord_ss(a: __m128, b: __m128) -> __m128 {
 /// of the result will be `0xffffffff` if any of `a.extract(0)` or
 /// `b.extract(0)` is a NaN, or `0` otherwise. The upper 96 bits of the result
 /// are the upper 96 bits of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpunordss))]
 pub unsafe fn _mm_cmpunord_ss(a: __m128, b: __m128) -> __m128 {
@ -359,7 +359,7 @@ pub unsafe fn _mm_cmpunord_ss(a: __m128, b: __m128) -> __m128 {
 /// Compare each of the four floats in `a` to the corresponding element in `b`.
 /// The result in the output vector will be `0xffffffff` if the input elements
 /// were equal, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpeqps))]
 pub unsafe fn _mm_cmpeq_ps(a: __m128, b: __m128) -> __m128 {
@ -369,7 +369,7 @@ pub unsafe fn _mm_cmpeq_ps(a: __m128, b: __m128) -> __m128 {
 /// Compare each of the four floats in `a` to the corresponding element in `b`.
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is less than the corresponding element in `b`, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpltps))]
 pub unsafe fn _mm_cmplt_ps(a: __m128, b: __m128) -> __m128 {
@ -380,7 +380,7 @@ pub unsafe fn _mm_cmplt_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is less than or equal to the corresponding element in `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpleps))]
 pub unsafe fn _mm_cmple_ps(a: __m128, b: __m128) -> __m128 {
@ -390,7 +390,7 @@ pub unsafe fn _mm_cmple_ps(a: __m128, b: __m128) -> __m128 {
 /// Compare each of the four floats in `a` to the corresponding element in `b`.
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is greater than the corresponding element in `b`, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpltps))]
 pub unsafe fn _mm_cmpgt_ps(a: __m128, b: __m128) -> __m128 {
@ -401,7 +401,7 @@ pub unsafe fn _mm_cmpgt_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is greater than or equal to the corresponding element in `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpleps))]
 pub unsafe fn _mm_cmpge_ps(a: __m128, b: __m128) -> __m128 {
@ -411,7 +411,7 @@ pub unsafe fn _mm_cmpge_ps(a: __m128, b: __m128) -> __m128 {
 /// Compare each of the four floats in `a` to the corresponding element in `b`.
 /// The result in the output vector will be `0xffffffff` if the input elements
 /// are *not* equal, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpneqps))]
 pub unsafe fn _mm_cmpneq_ps(a: __m128, b: __m128) -> __m128 {
@ -422,7 +422,7 @@ pub unsafe fn _mm_cmpneq_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is *not* less than the corresponding element in `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnltps))]
 pub unsafe fn _mm_cmpnlt_ps(a: __m128, b: __m128) -> __m128 {
@ -433,7 +433,7 @@ pub unsafe fn _mm_cmpnlt_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is *not* less than or equal to the corresponding element in `b`, or
 /// `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnleps))]
 pub unsafe fn _mm_cmpnle_ps(a: __m128, b: __m128) -> __m128 {
@ -444,7 +444,7 @@ pub unsafe fn _mm_cmpnle_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is *not* greater than the corresponding element in `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnltps))]
 pub unsafe fn _mm_cmpngt_ps(a: __m128, b: __m128) -> __m128 {
@ -455,7 +455,7 @@ pub unsafe fn _mm_cmpngt_ps(a: __m128, b: __m128) -> __m128 {
 /// The result in the output vector will be `0xffffffff` if the input element
 /// in `a` is *not* greater than or equal to the corresponding element in `b`,
 /// or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpnleps))]
 pub unsafe fn _mm_cmpnge_ps(a: __m128, b: __m128) -> __m128 {
@ -466,7 +466,7 @@ pub unsafe fn _mm_cmpnge_ps(a: __m128, b: __m128) -> __m128 {
 /// Returns four floats that have one of two possible bit patterns. The element
 /// in the output vector will be `0xffffffff` if the input elements in `a` and
 /// `b` are ordered (i.e., neither of them is a NaN), or 0 otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpordps))]
 pub unsafe fn _mm_cmpord_ps(a: __m128, b: __m128) -> __m128 {
@ -477,7 +477,7 @@ pub unsafe fn _mm_cmpord_ps(a: __m128, b: __m128) -> __m128 {
 /// Returns four floats that have one of two possible bit patterns. The element
 /// in the output vector will be `0xffffffff` if the input elements in `a` and
 /// `b` are unordered (i.e., at least on of them is a NaN), or 0 otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cmpunordps))]
 pub unsafe fn _mm_cmpunord_ps(a: __m128, b: __m128) -> __m128 {
@ -486,7 +486,7 @@ pub unsafe fn _mm_cmpunord_ps(a: __m128, b: __m128) -> __m128 {

 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if they are equal, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comieq_ss(a: __m128, b: __m128) -> i32 {
@ -495,7 +495,7 @@ pub unsafe fn _mm_comieq_ss(a: __m128, b: __m128) -> i32 {

 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if the value from `a` is less than the one from `b`, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comilt_ss(a: __m128, b: __m128) -> i32 {
@ -505,7 +505,7 @@ pub unsafe fn _mm_comilt_ss(a: __m128, b: __m128) -> i32 {
 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if the value from `a` is less than or equal to the one from `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comile_ss(a: __m128, b: __m128) -> i32 {
@ -515,7 +515,7 @@ pub unsafe fn _mm_comile_ss(a: __m128, b: __m128) -> i32 {
 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if the value from `a` is greater than the one from `b`, or `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comigt_ss(a: __m128, b: __m128) -> i32 {
@ -525,7 +525,7 @@ pub unsafe fn _mm_comigt_ss(a: __m128, b: __m128) -> i32 {
 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if the value from `a` is greater than or equal to the one from `b`, or
 /// `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comige_ss(a: __m128, b: __m128) -> i32 {
@ -534,7 +534,7 @@ pub unsafe fn _mm_comige_ss(a: __m128, b: __m128) -> i32 {

 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if they are *not* equal, or `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(comiss))]
 pub unsafe fn _mm_comineq_ss(a: __m128, b: __m128) -> i32 {
@ -544,7 +544,7 @@ pub unsafe fn _mm_comineq_ss(a: __m128, b: __m128) -> i32 {
 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if they are equal, or `0` otherwise. This instruction will not signal
 /// an exception if either argument is a quiet NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomieq_ss(a: __m128, b: __m128) -> i32 {
@ -555,7 +555,7 @@ pub unsafe fn _mm_ucomieq_ss(a: __m128, b: __m128) -> i32 {
 /// `1` if the value from `a` is less than the one from `b`, or `0` otherwise.
 /// This instruction will not signal an exception if either argument is a quiet
 /// NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomilt_ss(a: __m128, b: __m128) -> i32 {
@ -566,7 +566,7 @@ pub unsafe fn _mm_ucomilt_ss(a: __m128, b: __m128) -> i32 {
 /// `1` if the value from `a` is less than or equal to the one from `b`, or `0`
 /// otherwise. This instruction will not signal an exception if either argument
 /// is a quiet NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomile_ss(a: __m128, b: __m128) -> i32 {
@ -577,7 +577,7 @@ pub unsafe fn _mm_ucomile_ss(a: __m128, b: __m128) -> i32 {
 /// `1` if the value from `a` is greater than the one from `b`, or `0`
 /// otherwise. This instruction will not signal an exception if either argument
 /// is a quiet NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomigt_ss(a: __m128, b: __m128) -> i32 {
@ -588,7 +588,7 @@ pub unsafe fn _mm_ucomigt_ss(a: __m128, b: __m128) -> i32 {
 /// `1` if the value from `a` is greater than or equal to the one from `b`, or
 /// `0` otherwise. This instruction will not signal an exception if either
 /// argument is a quiet NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomige_ss(a: __m128, b: __m128) -> i32 {
@ -598,7 +598,7 @@ pub unsafe fn _mm_ucomige_ss(a: __m128, b: __m128) -> i32 {
 /// Compare two 32-bit floats from the low-order bits of `a` and `b`. Returns
 /// `1` if they are *not* equal, or `0` otherwise. This instruction will not
 /// signal an exception if either argument is a quiet NaN.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ucomiss))]
 pub unsafe fn _mm_ucomineq_ss(a: __m128, b: __m128) -> i32 {
@ -613,7 +613,7 @@ pub unsafe fn _mm_ucomineq_ss(a: __m128, b: __m128) -> i32 {
 /// unmasked (see [`_mm_setcsr`](fn._mm_setcsr.html)).
 ///
 /// This corresponds to the `CVTSS2SI` instruction (with 32 bit output).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtss2si))]
 pub unsafe fn _mm_cvtss_si32(a: __m128) -> i32 {
@ -621,7 +621,7 @@ pub unsafe fn _mm_cvtss_si32(a: __m128) -> i32 {
 }

 /// Alias for [`_mm_cvtss_si32`](fn._mm_cvtss_si32.html).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtss2si))]
 pub unsafe fn _mm_cvt_ss2si(a: __m128) -> i32 {
@ -638,7 +638,7 @@ pub unsafe fn _mm_cvt_ss2si(a: __m128) -> i32 {
 /// exception if unmasked (see [`_mm_setcsr`](fn._mm_setcsr.html)).
 ///
 /// This corresponds to the `CVTTSS2SI` instruction (with 32 bit output).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvttss2si))]
 pub unsafe fn _mm_cvttss_si32(a: __m128) -> i32 {
@ -646,7 +646,7 @@ pub unsafe fn _mm_cvttss_si32(a: __m128) -> i32 {
 }

 /// Alias for [`_mm_cvttss_si32`](fn._mm_cvttss_si32.html).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvttss2si))]
 pub unsafe fn _mm_cvtt_ss2si(a: __m128) -> i32 {
@ -654,7 +654,7 @@ pub unsafe fn _mm_cvtt_ss2si(a: __m128) -> i32 {
 }

 /// Extract the lowest 32 bit float from the input vector.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // No point in using assert_instrs. In Unix x86_64 calling convention this is a
 // no-op, and on Windows it's just a `mov`.
@ -667,7 +667,7 @@ pub unsafe fn _mm_cvtss_f32(a: __m128) -> f32 {
 ///
 /// This intrinsic corresponds to the `CVTSI2SS` instruction (with 32 bit
 /// input).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtsi2ss))]
 pub unsafe fn _mm_cvtsi32_ss(a: __m128, b: i32) -> __m128 {
@ -675,7 +675,7 @@ pub unsafe fn _mm_cvtsi32_ss(a: __m128, b: i32) -> __m128 {
 }

 /// Alias for [`_mm_cvtsi32_ss`](fn._mm_cvtsi32_ss.html).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtsi2ss))]
 pub unsafe fn _mm_cvt_si2ss(a: __m128, b: i32) -> __m128 {
@ -684,7 +684,7 @@ pub unsafe fn _mm_cvt_si2ss(a: __m128, b: i32) -> __m128 {

 /// Construct a `__m128` with the lowest element set to `a` and the rest set to
 /// zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_set_ss(a: f32) -> __m128 {
@ -692,7 +692,7 @@ pub unsafe fn _mm_set_ss(a: f32) -> __m128 {
 }

 /// Construct a `__m128` with all element set to `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(shufps))]
 pub unsafe fn _mm_set1_ps(a: f32) -> __m128 {
@ -700,7 +700,7 @@ pub unsafe fn _mm_set1_ps(a: f32) -> __m128 {
 }

 /// Alias for [`_mm_set1_ps`](fn._mm_set1_ps.html)
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(shufps))]
 pub unsafe fn _mm_set_ps1(a: f32) -> __m128 {
@ -724,7 +724,7 @@ pub unsafe fn _mm_set_ps1(a: f32) -> __m128 {
 /// ```text
 /// let v = _mm_set_ps(d, c, b, a);
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(unpcklps))]
 pub unsafe fn _mm_set_ps(a: f32, b: f32, c: f32, d: f32) -> __m128 {
@ -739,7 +739,7 @@ pub unsafe fn _mm_set_ps(a: f32, b: f32, c: f32, d: f32) -> __m128 {
 /// ```text
 /// assert_eq!(__m128::new(a, b, c, d), _mm_setr_ps(a, b, c, d));
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(all(test, target_arch = "x86_64"), assert_instr(unpcklps))]
 // On a 32-bit architecture it just copies the operands from the stack.
@ -749,7 +749,7 @@ pub unsafe fn _mm_setr_ps(a: f32, b: f32, c: f32, d: f32) -> __m128 {
 }

 /// Construct a `__m128` with all elements initialized to zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(xorps))]
 pub unsafe fn _mm_setzero_ps() -> __m128 {
@ -761,7 +761,7 @@ pub unsafe fn _mm_setzero_ps() -> __m128 {
 ///
 /// The lower half of result takes values from `a` and the higher half from
 /// `b`. Mask is split to 2 control bits each to index the element from inputs.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(shufps, mask = 3))]
 pub unsafe fn _mm_shuffle_ps(a: __m128, b: __m128, mask: u32) -> __m128 {
@ -812,7 +812,7 @@ pub unsafe fn _mm_shuffle_ps(a: __m128, b: __m128, mask: u32) -> __m128 {

 /// Unpack and interleave single-precision (32-bit) floating-point elements
 /// from the higher half of `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(unpckhps))]
 pub unsafe fn _mm_unpackhi_ps(a: __m128, b: __m128) -> __m128 {
@ -821,7 +821,7 @@ pub unsafe fn _mm_unpackhi_ps(a: __m128, b: __m128) -> __m128 {

 /// Unpack and interleave single-precision (32-bit) floating-point elements
 /// from the lower half of `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(unpcklps))]
 pub unsafe fn _mm_unpacklo_ps(a: __m128, b: __m128) -> __m128 {
@ -830,7 +830,7 @@ pub unsafe fn _mm_unpacklo_ps(a: __m128, b: __m128) -> __m128 {

 /// Combine higher half of `a` and `b`. The highwe half of `b` occupies the
 /// lower half of result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(all(test, not(windows)), assert_instr(movhlps))]
 #[cfg_attr(all(test, windows), assert_instr(unpckhpd))]
@ -841,7 +841,7 @@ pub unsafe fn _mm_movehl_ps(a: __m128, b: __m128) -> __m128 {

 /// Combine lower half of `a` and `b`. The lower half of `b` occupies the
 /// higher half of result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(all(test, target_feature = "sse2"), assert_instr(unpcklpd))]
 #[cfg_attr(all(test, not(target_feature = "sse2")), assert_instr(movlhps))]
@ -853,7 +853,7 @@ pub unsafe fn _mm_movelh_ps(a: __m128, b: __m128) -> __m128 {
 ///
 /// The mask is stored in the 4 least significant bits of the return value.
 /// All other bits are set to `0`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movmskps))]
 pub unsafe fn _mm_movemask_ps(a: __m128) -> i32 {
@ -892,7 +892,7 @@ pub unsafe fn _mm_movemask_ps(a: __m128) -> i32 {
 /// #     }
 /// # }
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // TODO: generates MOVHPD if the CPU supports SSE2.
 // #[cfg_attr(test, assert_instr(movhps))]
@ -943,7 +943,7 @@ pub unsafe fn _mm_loadh_pi(a: __m128, p: *const __m64) -> __m128 {
 /// #     }
 /// # }
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // TODO: generates MOVLPD if the CPU supports SSE2.
 // #[cfg_attr(test, assert_instr(movlps))]
@ -966,7 +966,7 @@ pub unsafe fn _mm_loadl_pi(a: __m128, p: *const __m64) -> __m128 {
 /// elements set to zero.
 ///
 /// This corresponds to instructions `VMOVSS` / `MOVSS`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_load_ss(p: *const f32) -> __m128 {
@ -978,7 +978,7 @@ pub unsafe fn _mm_load_ss(p: *const f32) -> __m128 {
 ///
 /// This corresponds to instructions `VMOVSS` / `MOVSS` followed by some
 /// shuffling.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_load1_ps(p: *const f32) -> __m128 {
@ -987,7 +987,7 @@ pub unsafe fn _mm_load1_ps(p: *const f32) -> __m128 {
 }

 /// Alias for [`_mm_load1_ps`](fn._mm_load1_ps.html)
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_load_ps1(p: *const f32) -> __m128 {
@ -1002,7 +1002,7 @@ pub unsafe fn _mm_load_ps1(p: *const f32) -> __m128 {
 /// memory.
 ///
 /// This corresponds to instructions `VMOVAPS` / `MOVAPS`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_load_ps(p: *const f32) -> __m128 {
@ -1016,7 +1016,7 @@ pub unsafe fn _mm_load_ps(p: *const f32) -> __m128 {
 /// may be faster.
 ///
 /// This corresponds to instructions `VMOVUPS` / `MOVUPS`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movups))]
 pub unsafe fn _mm_loadu_ps(p: *const f32) -> __m128 {
@ -1049,7 +1049,7 @@ pub unsafe fn _mm_loadu_ps(p: *const f32) -> __m128 {
 ///
 /// This corresponds to instructions `VMOVAPS` / `MOVAPS` followed by some
 /// shuffling.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_loadr_ps(p: *const f32) -> __m128 {
@ -1061,7 +1061,7 @@ pub unsafe fn _mm_loadr_ps(p: *const f32) -> __m128 {
 ///
 /// This intrinsic corresponds to the `MOVHPS` instruction. The compiler may
 /// choose to generate an equivalent sequence of other instructions.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // On i686 and up LLVM actually generates MOVHPD instead of MOVHPS, that's
 // fine.
@ -1091,7 +1091,7 @@ pub unsafe fn _mm_storeh_pi(p: *mut __m64, a: __m128) {
 ///
 /// This intrinsic corresponds to the `MOVQ` instruction. The compiler may
 /// choose to generate an equivalent sequence of other instructions.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 // On i586 the codegen just generates plane MOVs. No need to test for that.
 #[cfg_attr(all(test, any(target_arch = "x86_64", target_feature = "sse2"),
@ -1121,7 +1121,7 @@ pub unsafe fn _mm_storel_pi(p: *mut __m64, a: __m128) {
 /// Store the lowest 32 bit float of `a` into memory.
 ///
 /// This intrinsic corresponds to the `MOVSS` instruction.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_store_ss(p: *mut f32, a: __m128) {
@ -1144,7 +1144,7 @@ pub unsafe fn _mm_store_ss(p: *mut f32, a: __m128) {
 /// *p.offset(2) = x;
 /// *p.offset(3) = x;
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_store1_ps(p: *mut f32, a: __m128) {
@ -1153,7 +1153,7 @@ pub unsafe fn _mm_store1_ps(p: *mut f32, a: __m128) {
 }

 /// Alias for [`_mm_store1_ps`](fn._mm_store1_ps.html)
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_store_ps1(p: *mut f32, a: __m128) {
@ -1169,7 +1169,7 @@ pub unsafe fn _mm_store_ps1(p: *mut f32, a: __m128) {
 /// memory.
 ///
 /// This corresponds to instructions `VMOVAPS` / `MOVAPS`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_store_ps(p: *mut f32, a: __m128) {
@ -1181,7 +1181,7 @@ pub unsafe fn _mm_store_ps(p: *mut f32, a: __m128) {
 /// faster.
 ///
 /// This corresponds to instructions `VMOVUPS` / `MOVUPS`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movups))]
 pub unsafe fn _mm_storeu_ps(p: *mut f32, a: __m128) {
@ -1206,7 +1206,7 @@ pub unsafe fn _mm_storeu_ps(p: *mut f32, a: __m128) {
 /// *p.offset(2) = a.extract(1);
 /// *p.offset(3) = a.extract(0);
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movaps))]
 pub unsafe fn _mm_storer_ps(p: *mut f32, a: __m128) {
@ -1221,7 +1221,7 @@ pub unsafe fn _mm_storer_ps(p: *mut f32, a: __m128) {
 /// ```text
 /// _mm_move_ss(a, b) == a.replace(0, b.extract(0))
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movss))]
 pub unsafe fn _mm_move_ss(a: __m128, b: __m128) -> __m128 {
@ -1234,7 +1234,7 @@ pub unsafe fn _mm_move_ss(a: __m128, b: __m128) -> __m128 {
 /// Guarantees that every store instruction that precedes, in program order, is
 /// globally visible before any store instruction which follows the fence in
 /// program order.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(sfence))]
 pub unsafe fn _mm_sfence() {
@ -1244,7 +1244,7 @@ pub unsafe fn _mm_sfence() {
 /// Get the unsigned 32-bit value of the MXCSR control and status register.
 ///
 /// For more info see [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(stmxcsr))]
 pub unsafe fn _mm_getcsr() -> u32 {
@ -1378,7 +1378,7 @@ pub unsafe fn _mm_getcsr() -> u32 {
 /// _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);  // turn on
 /// ```
 ///
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(ldmxcsr))]
 pub unsafe fn _mm_setcsr(val: u32) {
@ -1435,7 +1435,7 @@ pub const _MM_FLUSH_ZERO_ON: u32 = 0x8000;
 pub const _MM_FLUSH_ZERO_OFF: u32 = 0x0000;

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_GET_EXCEPTION_MASK() -> u32 {
@ -1443,7 +1443,7 @@ pub unsafe fn _MM_GET_EXCEPTION_MASK() -> u32 {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_GET_EXCEPTION_STATE() -> u32 {
@ -1451,7 +1451,7 @@ pub unsafe fn _MM_GET_EXCEPTION_STATE() -> u32 {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_GET_FLUSH_ZERO_MODE() -> u32 {
@ -1459,7 +1459,7 @@ pub unsafe fn _MM_GET_FLUSH_ZERO_MODE() -> u32 {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_GET_ROUNDING_MODE() -> u32 {
@ -1467,7 +1467,7 @@ pub unsafe fn _MM_GET_ROUNDING_MODE() -> u32 {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_SET_EXCEPTION_MASK(x: u32) {
@ -1475,7 +1475,7 @@ pub unsafe fn _MM_SET_EXCEPTION_MASK(x: u32) {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_SET_EXCEPTION_STATE(x: u32) {
@ -1483,7 +1483,7 @@ pub unsafe fn _MM_SET_EXCEPTION_STATE(x: u32) {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_SET_FLUSH_ZERO_MODE(x: u32) {
@ -1493,7 +1493,7 @@ pub unsafe fn _MM_SET_FLUSH_ZERO_MODE(x: u32) {
 }

 /// See [`_mm_setcsr`](fn._mm_setcsr.html)
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_SET_ROUNDING_MODE(x: u32) {
@ -1548,7 +1548,7 @@ pub const _MM_HINT_NTA: i8 = 0;
 /// * Prefetching may also fail if there are not enough memory-subsystem
 ///   resources (e.g., request buffers).
 ///
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(prefetcht0, strategy = _MM_HINT_T0))]
 #[cfg_attr(test, assert_instr(prefetcht1, strategy = _MM_HINT_T1))]
@ -1573,7 +1573,7 @@ pub unsafe fn _mm_prefetch(p: *const u8, strategy: i8) {
 }

 /// Return vector of type __m128 with undefined elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 pub unsafe fn _mm_undefined_ps() -> __m128 {
    __m128(
@ -1585,7 +1585,7 @@ pub unsafe fn _mm_undefined_ps() -> __m128 {
 }

 /// Transpose the 4x4 matrix formed by 4 rows of __m128 in place.
-#[inline(always)]
+#[inline]
 #[allow(non_snake_case)]
 #[target_feature(enable = "sse")]
 pub unsafe fn _MM_TRANSPOSE4_PS(
@ -1684,7 +1684,7 @@ extern "C" {
 ///
 /// `mem_addr` must be aligned on a 16-byte boundary or a general-protection
 /// exception _may_ be generated.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(movntps))]
 pub unsafe fn _mm_stream_ps(mem_addr: *mut f32, a: __m128) {
@ -1693,7 +1693,7 @@ pub unsafe fn _mm_stream_ps(mem_addr: *mut f32, a: __m128) {

 /// Store 64-bits of integer data from a into memory using a non-temporal
 /// memory hint.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse,mmx")]
 #[cfg_attr(test, assert_instr(movntq))]
 pub unsafe fn _mm_stream_pi(mem_addr: *mut __m64, a: __m64) {
--- a/library/stdarch/coresimd/src/x86/i586/sse2.rs
+++ b/library/stdarch/coresimd/src/x86/i586/sse2.rs
--- a/library/stdarch/coresimd/src/x86/i586/sse3.rs
+++ b/library/stdarch/coresimd/src/x86/i586/sse3.rs
@ -9,7 +9,7 @@ use stdsimd_test::assert_instr;

 /// Alternatively add and subtract packed single-precision (32-bit)
 /// floating-point elements in `a` to/from packed elements in `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(addsubps))]
 pub unsafe fn _mm_addsub_ps(a: __m128, b: __m128) -> __m128 {
@ -18,7 +18,7 @@ pub unsafe fn _mm_addsub_ps(a: __m128, b: __m128) -> __m128 {

 /// Alternatively add and subtract packed double-precision (64-bit)
 /// floating-point elements in `a` to/from packed elements in `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(addsubpd))]
 pub unsafe fn _mm_addsub_pd(a: __m128d, b: __m128d) -> __m128d {
@ -27,7 +27,7 @@ pub unsafe fn _mm_addsub_pd(a: __m128d, b: __m128d) -> __m128d {

 /// Horizontally add adjacent pairs of double-precision (64-bit)
 /// floating-point elements in `a` and `b`, and pack the results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(haddpd))]
 pub unsafe fn _mm_hadd_pd(a: __m128d, b: __m128d) -> __m128d {
@ -36,7 +36,7 @@ pub unsafe fn _mm_hadd_pd(a: __m128d, b: __m128d) -> __m128d {

 /// Horizontally add adjacent pairs of single-precision (32-bit)
 /// floating-point elements in `a` and `b`, and pack the results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(haddps))]
 pub unsafe fn _mm_hadd_ps(a: __m128, b: __m128) -> __m128 {
@ -45,7 +45,7 @@ pub unsafe fn _mm_hadd_ps(a: __m128, b: __m128) -> __m128 {

 /// Horizontally subtract adjacent pairs of double-precision (64-bit)
 /// floating-point elements in `a` and `b`, and pack the results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(hsubpd))]
 pub unsafe fn _mm_hsub_pd(a: __m128d, b: __m128d) -> __m128d {
@ -54,7 +54,7 @@ pub unsafe fn _mm_hsub_pd(a: __m128d, b: __m128d) -> __m128d {

 /// Horizontally add adjacent pairs of single-precision (32-bit)
 /// floating-point elements in `a` and `b`, and pack the results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(hsubps))]
 pub unsafe fn _mm_hsub_ps(a: __m128, b: __m128) -> __m128 {
@ -64,7 +64,7 @@ pub unsafe fn _mm_hsub_ps(a: __m128, b: __m128) -> __m128 {
 /// Load 128-bits of integer data from unaligned memory.
 /// This intrinsic may perform better than `_mm_loadu_si128`
 /// when the data crosses a cache line boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(lddqu))]
 pub unsafe fn _mm_lddqu_si128(mem_addr: *const __m128i) -> __m128i {
@ -73,7 +73,7 @@ pub unsafe fn _mm_lddqu_si128(mem_addr: *const __m128i) -> __m128i {

 /// Duplicate the low double-precision (64-bit) floating-point element
 /// from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(movddup))]
 pub unsafe fn _mm_movedup_pd(a: __m128d) -> __m128d {
@ -82,7 +82,7 @@ pub unsafe fn _mm_movedup_pd(a: __m128d) -> __m128d {

 /// Load a double-precision (64-bit) floating-point element from memory
 /// into both elements of return vector.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(movddup))]
 pub unsafe fn _mm_loaddup_pd(mem_addr: *const f64) -> __m128d {
@ -91,7 +91,7 @@ pub unsafe fn _mm_loaddup_pd(mem_addr: *const f64) -> __m128d {

 /// Duplicate odd-indexed single-precision (32-bit) floating-point elements
 /// from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(movshdup))]
 pub unsafe fn _mm_movehdup_ps(a: __m128) -> __m128 {
@ -100,7 +100,7 @@ pub unsafe fn _mm_movehdup_ps(a: __m128) -> __m128 {

 /// Duplicate even-indexed single-precision (32-bit) floating-point elements
 /// from `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse3")]
 #[cfg_attr(test, assert_instr(movsldup))]
 pub unsafe fn _mm_moveldup_ps(a: __m128) -> __m128 {
--- a/library/stdarch/coresimd/src/x86/i586/sse41.rs
+++ b/library/stdarch/coresimd/src/x86/i586/sse41.rs
@ -47,7 +47,7 @@ pub const _MM_FROUND_NEARBYINT: i32 =
 /// The high bit of each corresponding mask byte determines the selection.
 /// If the high bit is set the element of `a` is selected. The element
 /// of `b` is selected otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pblendvb))]
 pub unsafe fn _mm_blendv_epi8(a: __m128i, b: __m128i, mask: __m128i) -> __m128i {
@ -59,7 +59,7 @@ pub unsafe fn _mm_blendv_epi8(a: __m128i, b: __m128i, mask: __m128i) -> __m128i
 /// The mask bits determine the selection. A clear bit selects the
 /// corresponding element of `a`, and a set bit the corresponding
 /// element of `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pblendw, imm8 = 0xF0))]
 pub unsafe fn _mm_blend_epi16(a: __m128i, b: __m128i, imm8: i32) -> __m128i {
@ -73,7 +73,7 @@ pub unsafe fn _mm_blend_epi16(a: __m128i, b: __m128i, imm8: i32) -> __m128i {

 /// Blend packed double-precision (64-bit) floating-point elements from `a`
 /// and `b` using `mask`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(blendvpd))]
 pub unsafe fn _mm_blendv_pd(a: __m128d, b: __m128d, mask: __m128d) -> __m128d {
@ -82,7 +82,7 @@ pub unsafe fn _mm_blendv_pd(a: __m128d, b: __m128d, mask: __m128d) -> __m128d {

 /// Blend packed single-precision (32-bit) floating-point elements from `a`
 /// and `b` using `mask`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(blendvps))]
 pub unsafe fn _mm_blendv_ps(a: __m128, b: __m128, mask: __m128) -> __m128 {
@ -91,7 +91,7 @@ pub unsafe fn _mm_blendv_ps(a: __m128, b: __m128, mask: __m128) -> __m128 {

 /// Blend packed double-precision (64-bit) floating-point elements from `a`
 /// and `b` using control mask `imm2`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(blendpd, imm2 = 0b10))]
 pub unsafe fn _mm_blend_pd(a: __m128d, b: __m128d, imm2: i32) -> __m128d {
@ -103,7 +103,7 @@ pub unsafe fn _mm_blend_pd(a: __m128d, b: __m128d, imm2: i32) -> __m128d {

 /// Blend packed single-precision (32-bit) floating-point elements from `a`
 /// and `b` using mask `imm4`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(blendps, imm4 = 0b0101))]
 pub unsafe fn _mm_blend_ps(a: __m128, b: __m128, imm4: i32) -> __m128 {
@ -115,7 +115,7 @@ pub unsafe fn _mm_blend_ps(a: __m128, b: __m128, imm4: i32) -> __m128 {

 /// Extract a single-precision (32-bit) floating-point element from `a`,
 /// selected with `imm8`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 // TODO: Add test for Windows
 #[cfg_attr(all(test, not(windows)), assert_instr(extractps, imm8 = 0))]
@ -127,7 +127,7 @@ pub unsafe fn _mm_extract_ps(a: __m128, imm8: i32) -> i32 {
 /// integer containing the zero-extended integer data.
 ///
 /// See [LLVM commit D20468][https://reviews.llvm.org/D20468].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pextrb, imm8 = 0))]
 pub unsafe fn _mm_extract_epi8(a: __m128i, imm8: i32) -> i32 {
@ -136,7 +136,7 @@ pub unsafe fn _mm_extract_epi8(a: __m128i, imm8: i32) -> i32 {
 }

 /// Extract an 32-bit integer from `a` selected with `imm8`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 // TODO: Add test for Windows
 #[cfg_attr(all(test, not(windows)), assert_instr(pextrd, imm8 = 1))]
@ -167,7 +167,7 @@ pub unsafe fn _mm_extract_epi32(a: __m128i, imm8: i32) -> i32 {
 ///
 /// * Bits `[3:0]`: If any of these bits are set, the corresponding result
 /// element is cleared.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(insertps, imm8 = 0b1010))]
 pub unsafe fn _mm_insert_ps(a: __m128, b: __m128, imm8: i32) -> __m128 {
@ -179,7 +179,7 @@ pub unsafe fn _mm_insert_ps(a: __m128, b: __m128, imm8: i32) -> __m128 {

 /// Return a copy of `a` with the 8-bit integer from `i` inserted at a
 /// location specified by `imm8`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pinsrb, imm8 = 0))]
 pub unsafe fn _mm_insert_epi8(a: __m128i, i: i8, imm8: i32) -> __m128i {
@ -188,7 +188,7 @@ pub unsafe fn _mm_insert_epi8(a: __m128i, i: i8, imm8: i32) -> __m128i {

 /// Return a copy of `a` with the 32-bit integer from `i` inserted at a
 /// location specified by `imm8`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pinsrd, imm8 = 0))]
 pub unsafe fn _mm_insert_epi32(a: __m128i, i: i32, imm8: i32) -> __m128i {
@ -197,7 +197,7 @@ pub unsafe fn _mm_insert_epi32(a: __m128i, i: i32, imm8: i32) -> __m128i {

 /// Compare packed 8-bit integers in `a` and `b` and return packed maximum
 /// values in dst.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmaxsb))]
 pub unsafe fn _mm_max_epi8(a: __m128i, b: __m128i) -> __m128i {
@ -206,7 +206,7 @@ pub unsafe fn _mm_max_epi8(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed unsigned 16-bit integers in `a` and `b`, and return packed
 /// maximum.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmaxuw))]
 pub unsafe fn _mm_max_epu16(a: __m128i, b: __m128i) -> __m128i {
@ -215,7 +215,7 @@ pub unsafe fn _mm_max_epu16(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed 32-bit integers in `a` and `b`, and return packed maximum
 /// values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmaxsd))]
 pub unsafe fn _mm_max_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -224,7 +224,7 @@ pub unsafe fn _mm_max_epi32(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed unsigned 32-bit integers in `a` and `b`, and return packed
 /// maximum values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmaxud))]
 pub unsafe fn _mm_max_epu32(a: __m128i, b: __m128i) -> __m128i {
@ -233,7 +233,7 @@ pub unsafe fn _mm_max_epu32(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed 8-bit integers in `a` and `b` and return packed minimum
 /// values in dst.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pminsb))]
 pub unsafe fn _mm_min_epi8(a: __m128i, b: __m128i) -> __m128i {
@ -242,7 +242,7 @@ pub unsafe fn _mm_min_epi8(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed unsigned 16-bit integers in `a` and `b`, and return packed
 /// minimum.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pminuw))]
 pub unsafe fn _mm_min_epu16(a: __m128i, b: __m128i) -> __m128i {
@ -251,7 +251,7 @@ pub unsafe fn _mm_min_epu16(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed 32-bit integers in `a` and `b`, and return packed minimum
 /// values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pminsd))]
 pub unsafe fn _mm_min_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -260,7 +260,7 @@ pub unsafe fn _mm_min_epi32(a: __m128i, b: __m128i) -> __m128i {

 /// Compare packed unsigned 32-bit integers in `a` and `b`, and return packed
 /// minimum values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pminud))]
 pub unsafe fn _mm_min_epu32(a: __m128i, b: __m128i) -> __m128i {
@ -269,7 +269,7 @@ pub unsafe fn _mm_min_epu32(a: __m128i, b: __m128i) -> __m128i {

 /// Convert packed 32-bit integers from `a` and `b` to packed 16-bit integers
 /// using unsigned saturation
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(packusdw))]
 pub unsafe fn _mm_packus_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -277,7 +277,7 @@ pub unsafe fn _mm_packus_epi32(a: __m128i, b: __m128i) -> __m128i {
 }

 /// Compare packed 64-bit integers in `a` and `b` for equality
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pcmpeqq))]
 pub unsafe fn _mm_cmpeq_epi64(a: __m128i, b: __m128i) -> __m128i {
@ -285,7 +285,7 @@ pub unsafe fn _mm_cmpeq_epi64(a: __m128i, b: __m128i) -> __m128i {
 }

 /// Sign extend packed 8-bit integers in `a` to packed 16-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxbw))]
 pub unsafe fn _mm_cvtepi8_epi16(a: __m128i) -> __m128i {
@ -295,7 +295,7 @@ pub unsafe fn _mm_cvtepi8_epi16(a: __m128i) -> __m128i {
 }

 /// Sign extend packed 8-bit integers in `a` to packed 32-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxbd))]
 pub unsafe fn _mm_cvtepi8_epi32(a: __m128i) -> __m128i {
@ -306,7 +306,7 @@ pub unsafe fn _mm_cvtepi8_epi32(a: __m128i) -> __m128i {

 /// Sign extend packed 8-bit integers in the low 8 bytes of `a` to packed
 /// 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxbq))]
 pub unsafe fn _mm_cvtepi8_epi64(a: __m128i) -> __m128i {
@ -316,7 +316,7 @@ pub unsafe fn _mm_cvtepi8_epi64(a: __m128i) -> __m128i {
 }

 /// Sign extend packed 16-bit integers in `a` to packed 32-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxwd))]
 pub unsafe fn _mm_cvtepi16_epi32(a: __m128i) -> __m128i {
@ -326,7 +326,7 @@ pub unsafe fn _mm_cvtepi16_epi32(a: __m128i) -> __m128i {
 }

 /// Sign extend packed 16-bit integers in `a` to packed 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxwq))]
 pub unsafe fn _mm_cvtepi16_epi64(a: __m128i) -> __m128i {
@ -336,7 +336,7 @@ pub unsafe fn _mm_cvtepi16_epi64(a: __m128i) -> __m128i {
 }

 /// Sign extend packed 32-bit integers in `a` to packed 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovsxdq))]
 pub unsafe fn _mm_cvtepi32_epi64(a: __m128i) -> __m128i {
@ -346,7 +346,7 @@ pub unsafe fn _mm_cvtepi32_epi64(a: __m128i) -> __m128i {
 }

 /// Zero extend packed unsigned 8-bit integers in `a` to packed 16-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxbw))]
 pub unsafe fn _mm_cvtepu8_epi16(a: __m128i) -> __m128i {
@ -356,7 +356,7 @@ pub unsafe fn _mm_cvtepu8_epi16(a: __m128i) -> __m128i {
 }

 /// Zero extend packed unsigned 8-bit integers in `a` to packed 32-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxbd))]
 pub unsafe fn _mm_cvtepu8_epi32(a: __m128i) -> __m128i {
@ -366,7 +366,7 @@ pub unsafe fn _mm_cvtepu8_epi32(a: __m128i) -> __m128i {
 }

 /// Zero extend packed unsigned 8-bit integers in `a` to packed 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxbq))]
 pub unsafe fn _mm_cvtepu8_epi64(a: __m128i) -> __m128i {
@ -377,7 +377,7 @@ pub unsafe fn _mm_cvtepu8_epi64(a: __m128i) -> __m128i {

 /// Zero extend packed unsigned 16-bit integers in `a`
 /// to packed 32-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxwd))]
 pub unsafe fn _mm_cvtepu16_epi32(a: __m128i) -> __m128i {
@ -388,7 +388,7 @@ pub unsafe fn _mm_cvtepu16_epi32(a: __m128i) -> __m128i {

 /// Zero extend packed unsigned 16-bit integers in `a`
 /// to packed 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxwq))]
 pub unsafe fn _mm_cvtepu16_epi64(a: __m128i) -> __m128i {
@ -399,7 +399,7 @@ pub unsafe fn _mm_cvtepu16_epi64(a: __m128i) -> __m128i {

 /// Zero extend packed unsigned 32-bit integers in `a`
 /// to packed 64-bit integers
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmovzxdq))]
 pub unsafe fn _mm_cvtepu32_epi64(a: __m128i) -> __m128i {
@ -415,7 +415,7 @@ pub unsafe fn _mm_cvtepu32_epi64(a: __m128i) -> __m128i {
 /// replaced by a value of `0.0`. If a broadcast mask bit is one, the result of
 /// the dot product will be stored in the return value component. Otherwise if
 /// the broadcast mask bit is zero then the return component will be zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(dppd, imm8 = 0))]
 pub unsafe fn _mm_dp_pd(a: __m128d, b: __m128d, imm8: i32) -> __m128d {
@ -432,7 +432,7 @@ pub unsafe fn _mm_dp_pd(a: __m128d, b: __m128d, imm8: i32) -> __m128d {
 /// replaced by a value of `0.0`. If a broadcast mask bit is one, the result of
 /// the dot product will be stored in the return value component. Otherwise if
 /// the broadcast mask bit is zero then the return component will be zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(dpps, imm8 = 0))]
 pub unsafe fn _mm_dp_ps(a: __m128, b: __m128, imm8: i32) -> __m128 {
@ -445,7 +445,7 @@ pub unsafe fn _mm_dp_ps(a: __m128, b: __m128, imm8: i32) -> __m128 {
 /// Round the packed double-precision (64-bit) floating-point elements in `a`
 /// down to an integer value, and store the results as packed double-precision
 /// floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundpd))]
 pub unsafe fn _mm_floor_pd(a: __m128d) -> __m128d {
@ -455,7 +455,7 @@ pub unsafe fn _mm_floor_pd(a: __m128d) -> __m128d {
 /// Round the packed single-precision (32-bit) floating-point elements in `a`
 /// down to an integer value, and store the results as packed single-precision
 /// floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundps))]
 pub unsafe fn _mm_floor_ps(a: __m128) -> __m128 {
@ -467,7 +467,7 @@ pub unsafe fn _mm_floor_ps(a: __m128) -> __m128 {
 /// floating-point element in the lower element of the intrinsic result,
 /// and copy the upper element from `a` to the upper element of the intrinsic
 /// result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundsd))]
 pub unsafe fn _mm_floor_sd(a: __m128d, b: __m128d) -> __m128d {
@ -479,7 +479,7 @@ pub unsafe fn _mm_floor_sd(a: __m128d, b: __m128d) -> __m128d {
 /// floating-point element in the lower element of the intrinsic result,
 /// and copy the upper 3 packed elements from `a` to the upper elements
 /// of the intrinsic result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundss))]
 pub unsafe fn _mm_floor_ss(a: __m128, b: __m128) -> __m128 {
@ -489,7 +489,7 @@ pub unsafe fn _mm_floor_ss(a: __m128, b: __m128) -> __m128 {
 /// Round the packed double-precision (64-bit) floating-point elements in `a`
 /// up to an integer value, and store the results as packed double-precision
 /// floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundpd))]
 pub unsafe fn _mm_ceil_pd(a: __m128d) -> __m128d {
@ -499,7 +499,7 @@ pub unsafe fn _mm_ceil_pd(a: __m128d) -> __m128d {
 /// Round the packed single-precision (32-bit) floating-point elements in `a`
 /// up to an integer value, and store the results as packed single-precision
 /// floating-point elements.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundps))]
 pub unsafe fn _mm_ceil_ps(a: __m128) -> __m128 {
@ -511,7 +511,7 @@ pub unsafe fn _mm_ceil_ps(a: __m128) -> __m128 {
 /// floating-point element in the lower element of the intrisic result,
 /// and copy the upper element from `a` to the upper element
 /// of the intrinsic result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundsd))]
 pub unsafe fn _mm_ceil_sd(a: __m128d, b: __m128d) -> __m128d {
@ -523,7 +523,7 @@ pub unsafe fn _mm_ceil_sd(a: __m128d, b: __m128d) -> __m128d {
 /// floating-point element in the lower element of the intrinsic result,
 /// and copy the upper 3 packed elements from `a` to the upper elements
 /// of the intrinsic result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundss))]
 pub unsafe fn _mm_ceil_ss(a: __m128, b: __m128) -> __m128 {
@ -549,7 +549,7 @@ pub unsafe fn _mm_ceil_ss(a: __m128, b: __m128) -> __m128 {
 /// // use MXCSR.RC; see `vendor::_MM_SET_ROUNDING_MODE`:
 /// vendor::_MM_FROUND_CUR_DIRECTION;
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundpd, rounding = 0))]
 pub unsafe fn _mm_round_pd(a: __m128d, rounding: i32) -> __m128d {
@ -578,7 +578,7 @@ pub unsafe fn _mm_round_pd(a: __m128d, rounding: i32) -> __m128d {
 /// // use MXCSR.RC; see `vendor::_MM_SET_ROUNDING_MODE`:
 /// vendor::_MM_FROUND_CUR_DIRECTION;
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundps, rounding = 0))]
 pub unsafe fn _mm_round_ps(a: __m128, rounding: i32) -> __m128 {
@ -609,7 +609,7 @@ pub unsafe fn _mm_round_ps(a: __m128, rounding: i32) -> __m128 {
 /// // use MXCSR.RC; see `vendor::_MM_SET_ROUNDING_MODE`:
 /// vendor::_MM_FROUND_CUR_DIRECTION;
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundsd, rounding = 0))]
 pub unsafe fn _mm_round_sd(a: __m128d, b: __m128d, rounding: i32) -> __m128d {
@ -640,7 +640,7 @@ pub unsafe fn _mm_round_sd(a: __m128d, b: __m128d, rounding: i32) -> __m128d {
 /// // use MXCSR.RC; see `vendor::_MM_SET_ROUNDING_MODE`:
 /// vendor::_MM_FROUND_CUR_DIRECTION;
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(roundss, rounding = 0))]
 pub unsafe fn _mm_round_ss(a: __m128, b: __m128, rounding: i32) -> __m128 {
@ -669,7 +669,7 @@ pub unsafe fn _mm_round_ss(a: __m128, b: __m128, rounding: i32) -> __m128 {
 /// * bits `[15:0]` - contain the minimum value found in parameter `a`,
 /// * bits `[18:16]` - contain the index of the minimum value
 /// * remaining bits are set to `0`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(phminposuw))]
 pub unsafe fn _mm_minpos_epu16(a: __m128i) -> __m128i {
@ -678,7 +678,7 @@ pub unsafe fn _mm_minpos_epu16(a: __m128i) -> __m128i {

 /// Multiply the low 32-bit integers from each packed 64-bit
 /// element in `a` and `b`, and return the signed 64-bit result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmuldq))]
 pub unsafe fn _mm_mul_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -691,7 +691,7 @@ pub unsafe fn _mm_mul_epi32(a: __m128i, b: __m128i) -> __m128i {
 /// __m128i::splat(2)` returns the obvious `__m128i::splat(4)`, due to wrapping
 /// arithmetic `pmulld __m128i::splat(i32::MAX), __m128i::splat(2)` would return a
 /// negative number.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pmulld))]
 pub unsafe fn _mm_mullo_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -729,7 +729,7 @@ pub unsafe fn _mm_mullo_epi32(a: __m128i, b: __m128i) -> __m128i {
 ///
 /// * A `__m128i` vector containing the sums of the sets of
 ///   absolute differences between both operands.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(mpsadbw, imm8 = 0))]
 pub unsafe fn _mm_mpsadbw_epu8(a: __m128i, b: __m128i, imm8: i32) -> __m128i {
--- a/library/stdarch/coresimd/src/x86/i586/sse42.rs
+++ b/library/stdarch/coresimd/src/x86/i586/sse42.rs
@ -48,7 +48,7 @@ pub const _SIDD_UNIT_MASK: i32 = 0b0100_0000;

 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and return the generated mask.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistrm, imm8 = 0))]
 pub unsafe fn _mm_cmpistrm(a: __m128i, b: __m128i, imm8: i32) -> __m128i {
@ -258,7 +258,7 @@ pub unsafe fn _mm_cmpistrm(a: __m128i, b: __m128i, imm8: i32) -> __m128i {
 /// [`_SIDD_LEAST_SIGNIFICANT`]: constant._SIDD_LEAST_SIGNIFICANT.html
 /// [`_SIDD_MOST_SIGNIFICANT`]: constant._SIDD_MOST_SIGNIFICANT.html
 /// [`_mm_cmpestri`]: fn._mm_cmpestri.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistri(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -273,7 +273,7 @@ pub unsafe fn _mm_cmpistri(a: __m128i, b: __m128i, imm8: i32) -> i32 {
 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and return `1` if any character in `b` was null.
 /// and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistrz(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -288,7 +288,7 @@ pub unsafe fn _mm_cmpistrz(a: __m128i, b: __m128i, imm8: i32) -> i32 {
 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and return `1` if the resulting mask was non-zero,
 /// and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistrc(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -303,7 +303,7 @@ pub unsafe fn _mm_cmpistrc(a: __m128i, b: __m128i, imm8: i32) -> i32 {
 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and returns `1` if any character in `a` was null,
 /// and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistrs(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -317,7 +317,7 @@ pub unsafe fn _mm_cmpistrs(a: __m128i, b: __m128i, imm8: i32) -> i32 {

 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and return bit `0` of the resulting bit mask.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistro(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -332,7 +332,7 @@ pub unsafe fn _mm_cmpistro(a: __m128i, b: __m128i, imm8: i32) -> i32 {
 /// Compare packed strings with implicit lengths in `a` and `b` using the
 /// control in `imm8`, and return `1` if `b` did not contain a null
 /// character and the resulting mask was zero, and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpistri, imm8 = 0))]
 pub unsafe fn _mm_cmpistra(a: __m128i, b: __m128i, imm8: i32) -> i32 {
@ -346,7 +346,7 @@ pub unsafe fn _mm_cmpistra(a: __m128i, b: __m128i, imm8: i32) -> i32 {

 /// Compare packed strings in `a` and `b` with lengths `la` and `lb`
 /// using the control in `imm8`, and return the generated mask.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestrm, imm8 = 0))]
 pub unsafe fn _mm_cmpestrm(
@ -439,7 +439,7 @@ pub unsafe fn _mm_cmpestrm(
 /// [`_SIDD_LEAST_SIGNIFICANT`]: constant._SIDD_LEAST_SIGNIFICANT.html
 /// [`_SIDD_MOST_SIGNIFICANT`]: constant._SIDD_MOST_SIGNIFICANT.html
 /// [`_mm_cmpistri`]: fn._mm_cmpistri.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestri(
@ -456,7 +456,7 @@ pub unsafe fn _mm_cmpestri(
 /// Compare packed strings in `a` and `b` with lengths `la` and `lb`
 /// using the control in `imm8`, and return `1` if any character in
 /// `b` was null, and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestrz(
@ -473,7 +473,7 @@ pub unsafe fn _mm_cmpestrz(
 /// Compare packed strings in `a` and `b` with lengths `la` and `lb`
 /// using the control in `imm8`, and return `1` if the resulting mask
 /// was non-zero, and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestrc(
@ -490,7 +490,7 @@ pub unsafe fn _mm_cmpestrc(
 /// Compare packed strings in `a` and `b` with lengths `la` and `lb`
 /// using the control in `imm8`, and return `1` if any character in
 /// a was null, and `0` otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestrs(
@ -507,7 +507,7 @@ pub unsafe fn _mm_cmpestrs(
 /// Compare packed strings in `a` and `b` with lengths `la` and `lb`
 /// using the control in `imm8`, and return bit `0` of the resulting
 /// bit mask.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestro(
@ -525,7 +525,7 @@ pub unsafe fn _mm_cmpestro(
 /// using the control in `imm8`, and return `1` if `b` did not
 /// contain a null character and the resulting mask was zero, and `0`
 /// otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpestri, imm8 = 0))]
 pub unsafe fn _mm_cmpestra(
@ -541,7 +541,7 @@ pub unsafe fn _mm_cmpestra(

 /// Starting with the initial value in `crc`, return the accumulated
 /// CRC32 value for unsigned 8-bit integer `v`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(crc32))]
 pub unsafe fn _mm_crc32_u8(crc: u32, v: u8) -> u32 {
@ -550,7 +550,7 @@ pub unsafe fn _mm_crc32_u8(crc: u32, v: u8) -> u32 {

 /// Starting with the initial value in `crc`, return the accumulated
 /// CRC32 value for unsigned 16-bit integer `v`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(crc32))]
 pub unsafe fn _mm_crc32_u16(crc: u32, v: u16) -> u32 {
@ -559,7 +559,7 @@ pub unsafe fn _mm_crc32_u16(crc: u32, v: u16) -> u32 {

 /// Starting with the initial value in `crc`, return the accumulated
 /// CRC32 value for unsigned 32-bit integer `v`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(crc32))]
 pub unsafe fn _mm_crc32_u32(crc: u32, v: u32) -> u32 {
--- a/library/stdarch/coresimd/src/x86/i586/ssse3.rs
+++ b/library/stdarch/coresimd/src/x86/i586/ssse3.rs
@ -11,7 +11,7 @@ use x86::*;

 /// Compute the absolute value of packed 8-bit signed integers in `a` and
 /// return the unsigned results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pabsb))]
 pub unsafe fn _mm_abs_epi8(a: __m128i) -> __m128i {
@ -21,7 +21,7 @@ pub unsafe fn _mm_abs_epi8(a: __m128i) -> __m128i {
 /// Compute the absolute value of each of the packed 16-bit signed integers in
 /// `a` and
 /// return the 16-bit unsigned integer
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pabsw))]
 pub unsafe fn _mm_abs_epi16(a: __m128i) -> __m128i {
@ -31,7 +31,7 @@ pub unsafe fn _mm_abs_epi16(a: __m128i) -> __m128i {
 /// Compute the absolute value of each of the packed 32-bit signed integers in
 /// `a` and
 /// return the 32-bit unsigned integer
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pabsd))]
 pub unsafe fn _mm_abs_epi32(a: __m128i) -> __m128i {
@ -62,7 +62,7 @@ pub unsafe fn _mm_abs_epi32(a: __m128i) -> __m128i {
 ///     r
 /// }
 /// ```
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pshufb))]
 pub unsafe fn _mm_shuffle_epi8(a: __m128i, b: __m128i) -> __m128i {
@ -71,7 +71,7 @@ pub unsafe fn _mm_shuffle_epi8(a: __m128i, b: __m128i) -> __m128i {

 /// Concatenate 16-byte blocks in `a` and `b` into a 32-byte temporary result,
 /// shift the result right by `n` bytes, and return the low 16 bytes.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(palignr, n = 15))]
 pub unsafe fn _mm_alignr_epi8(a: __m128i, b: __m128i, n: i32) -> __m128i {
@ -129,7 +129,7 @@ pub unsafe fn _mm_alignr_epi8(a: __m128i, b: __m128i, n: i32) -> __m128i {

 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 128-bit vectors of [8 x i16].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phaddw))]
 pub unsafe fn _mm_hadd_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -139,7 +139,7 @@ pub unsafe fn _mm_hadd_epi16(a: __m128i, b: __m128i) -> __m128i {
 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 128-bit vectors of [8 x i16]. Positive sums greater than 7FFFh are
 /// saturated to 7FFFh. Negative sums less than 8000h are saturated to 8000h.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phaddsw))]
 pub unsafe fn _mm_hadds_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -148,7 +148,7 @@ pub unsafe fn _mm_hadds_epi16(a: __m128i, b: __m128i) -> __m128i {

 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 128-bit vectors of [4 x i32].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phaddd))]
 pub unsafe fn _mm_hadd_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -157,7 +157,7 @@ pub unsafe fn _mm_hadd_epi32(a: __m128i, b: __m128i) -> __m128i {

 /// Horizontally subtract the adjacent pairs of values contained in 2
 /// packed 128-bit vectors of [8 x i16].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phsubw))]
 pub unsafe fn _mm_hsub_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -168,7 +168,7 @@ pub unsafe fn _mm_hsub_epi16(a: __m128i, b: __m128i) -> __m128i {
 /// packed 128-bit vectors of [8 x i16]. Positive differences greater than
 /// 7FFFh are saturated to 7FFFh. Negative differences less than 8000h are
 /// saturated to 8000h.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phsubsw))]
 pub unsafe fn _mm_hsubs_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -177,7 +177,7 @@ pub unsafe fn _mm_hsubs_epi16(a: __m128i, b: __m128i) -> __m128i {

 /// Horizontally subtract the adjacent pairs of values contained in 2
 /// packed 128-bit vectors of [4 x i32].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(phsubd))]
 pub unsafe fn _mm_hsub_epi32(a: __m128i, b: __m128i) -> __m128i {
@ -189,7 +189,7 @@ pub unsafe fn _mm_hsub_epi32(a: __m128i, b: __m128i) -> __m128i {
 /// integer values contained in the second source operand, add pairs of
 /// contiguous products with signed saturation, and writes the 16-bit sums to
 /// the corresponding bits in the destination.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pmaddubsw))]
 pub unsafe fn _mm_maddubs_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -199,7 +199,7 @@ pub unsafe fn _mm_maddubs_epi16(a: __m128i, b: __m128i) -> __m128i {
 /// Multiply packed 16-bit signed integer values, truncate the 32-bit
 /// product to the 18 most significant bits by right-shifting, round the
 /// truncated value by adding 1, and write bits [16:1] to the destination.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(pmulhrsw))]
 pub unsafe fn _mm_mulhrs_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -210,7 +210,7 @@ pub unsafe fn _mm_mulhrs_epi16(a: __m128i, b: __m128i) -> __m128i {
 /// integer in `b` is negative, and return the result.
 /// Elements in result are zeroed out when the corresponding element in `b`
 /// is zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(psignb))]
 pub unsafe fn _mm_sign_epi8(a: __m128i, b: __m128i) -> __m128i {
@ -221,7 +221,7 @@ pub unsafe fn _mm_sign_epi8(a: __m128i, b: __m128i) -> __m128i {
 /// integer in `b` is negative, and return the results.
 /// Elements in result are zeroed out when the corresponding element in `b`
 /// is zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(psignw))]
 pub unsafe fn _mm_sign_epi16(a: __m128i, b: __m128i) -> __m128i {
@ -232,7 +232,7 @@ pub unsafe fn _mm_sign_epi16(a: __m128i, b: __m128i) -> __m128i {
 /// integer in `b` is negative, and return the results.
 /// Element in result are zeroed out when the corresponding element in `b`
 /// is zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3")]
 #[cfg_attr(test, assert_instr(psignd))]
 pub unsafe fn _mm_sign_epi32(a: __m128i, b: __m128i) -> __m128i {
--- a/library/stdarch/coresimd/src/x86/i586/tbm.rs
+++ b/library/stdarch/coresimd/src/x86/i586/tbm.rs
@ -27,7 +27,7 @@ extern "C" {

 /// Extracts bits in range [`start`, `start` + `length`) from `a` into
 /// the least significant bits of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 pub fn _bextr_u32(a: u32, start: u32, len: u32) -> u32 {
    _bextr2_u32(a, (start & 0xffu32) | ((len & 0xffu32) << 8u32))
@ -35,7 +35,7 @@ pub fn _bextr_u32(a: u32, start: u32, len: u32) -> u32 {

 /// Extracts bits in range [`start`, `start` + `length`) from `a` into
 /// the least significant bits of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 pub fn _bextr_u64(a: u64, start: u64, len: u64) -> u64 {
    _bextr2_u64(a, (start & 0xffu64) | ((len & 0xffu64) << 8u64))
@ -46,7 +46,7 @@ pub fn _bextr_u64(a: u64, start: u64, len: u64) -> u64 {
 ///
 /// Bits [7,0] of `control` specify the index to the first bit in the range to
 /// be extracted, and bits [15,8] specify the length of the range.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 pub fn _bextr2_u32(a: u32, control: u32) -> u32 {
    unsafe { x86_tbm_bextri_u32(a, control) }
@ -57,7 +57,7 @@ pub fn _bextr2_u32(a: u32, control: u32) -> u32 {
 ///
 /// Bits [7,0] of `control` specify the index to the first bit in the range to
 /// be extracted, and bits [15,8] specify the length of the range.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 pub fn _bextr2_u64(a: u64, control: u64) -> u64 {
    unsafe { x86_tbm_bextri_u64(a, control) }
@ -67,7 +67,7 @@ pub fn _bextr2_u64(a: u64, control: u64) -> u64 {
 /// Clears all bits below the least significant zero bit of `x`.
 ///
 /// If there is no zero bit in `x`, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcfill))]
 pub unsafe fn _blcfill_u32(x: u32) -> u32 {
@ -77,7 +77,7 @@ pub unsafe fn _blcfill_u32(x: u32) -> u32 {
 /// Clears all bits below the least significant zero bit of `x`.
 ///
 /// If there is no zero bit in `x`, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcfill))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -88,7 +88,7 @@ pub unsafe fn _blcfill_u64(x: u64) -> u64 {
 /// Sets all bits of `x` to 1 except for the least significant zero bit.
 ///
 /// If there is no zero bit in `x`, it sets all bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blci))]
 pub unsafe fn _blci_u32(x: u32) -> u32 {
@ -98,7 +98,7 @@ pub unsafe fn _blci_u32(x: u32) -> u32 {
 /// Sets all bits of `x` to 1 except for the least significant zero bit.
 ///
 /// If there is no zero bit in `x`, it sets all bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blci))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -109,7 +109,7 @@ pub unsafe fn _blci_u64(x: u64) -> u64 {
 /// Sets the least significant zero bit of `x` and clears all other bits.
 ///
 /// If there is no zero bit in `x`, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcic))]
 pub unsafe fn _blcic_u32(x: u32) -> u32 {
@ -119,7 +119,7 @@ pub unsafe fn _blcic_u32(x: u32) -> u32 {
 /// Sets the least significant zero bit of `x` and clears all other bits.
 ///
 /// If there is no zero bit in `x`, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcic))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -131,7 +131,7 @@ pub unsafe fn _blcic_u64(x: u64) -> u64 {
 /// that bit.
 ///
 /// If there is no zero bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcmsk))]
 pub unsafe fn _blcmsk_u32(x: u32) -> u32 {
@ -142,7 +142,7 @@ pub unsafe fn _blcmsk_u32(x: u32) -> u32 {
 /// that bit.
 ///
 /// If there is no zero bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcmsk))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -153,7 +153,7 @@ pub unsafe fn _blcmsk_u64(x: u64) -> u64 {
 /// Sets the least significant zero bit of `x`.
 ///
 /// If there is no zero bit in `x`, it returns `x`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcs))]
 pub unsafe fn _blcs_u32(x: u32) -> u32 {
@ -163,7 +163,7 @@ pub unsafe fn _blcs_u32(x: u32) -> u32 {
 /// Sets the least significant zero bit of `x`.
 ///
 /// If there is no zero bit in `x`, it returns `x`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blcs))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -174,7 +174,7 @@ pub unsafe fn _blcs_u64(x: u64) -> u64 {
 /// Sets all bits of `x` below the least significant one.
 ///
 /// If there is no set bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blsfill))]
 pub unsafe fn _blsfill_u32(x: u32) -> u32 {
@ -184,7 +184,7 @@ pub unsafe fn _blsfill_u32(x: u32) -> u32 {
 /// Sets all bits of `x` below the least significant one.
 ///
 /// If there is no set bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blsfill))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -195,7 +195,7 @@ pub unsafe fn _blsfill_u64(x: u64) -> u64 {
 /// Clears least significant bit and sets all other bits.
 ///
 /// If there is no set bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blsic))]
 pub unsafe fn _blsic_u32(x: u32) -> u32 {
@ -205,7 +205,7 @@ pub unsafe fn _blsic_u32(x: u32) -> u32 {
 /// Clears least significant bit and sets all other bits.
 ///
 /// If there is no set bit in `x`, it sets all the bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(blsic))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -217,7 +217,7 @@ pub unsafe fn _blsic_u64(x: u64) -> u64 {
 /// bits.
 ///
 /// If the least significant bit of `x` is 0, it sets all bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(t1mskc))]
 pub unsafe fn _t1mskc_u32(x: u32) -> u32 {
@ -228,7 +228,7 @@ pub unsafe fn _t1mskc_u32(x: u32) -> u32 {
 /// bits.
 ///
 /// If the least significant bit of `x` is 0, it sets all bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(t1mskc))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -240,7 +240,7 @@ pub unsafe fn _t1mskc_u64(x: u64) -> u64 {
 /// bits.
 ///
 /// If the least significant bit of `x` is 1, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(tzmsk))]
 pub unsafe fn _tzmsk_u32(x: u32) -> u32 {
@ -251,7 +251,7 @@ pub unsafe fn _tzmsk_u32(x: u32) -> u32 {
 /// bits.
 ///
 /// If the least significant bit of `x` is 1, it returns zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "tbm")]
 #[cfg_attr(test, assert_instr(tzmsk))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
--- a/library/stdarch/coresimd/src/x86/i586/xsave.rs
+++ b/library/stdarch/coresimd/src/x86/i586/xsave.rs
@ -33,7 +33,7 @@ extern "C" {
 ///
 /// The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of
 /// Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xsave))]
 pub unsafe fn _xsave(mem_addr: *mut u8, save_mask: u64) {
@ -46,7 +46,7 @@ pub unsafe fn _xsave(mem_addr: *mut u8, save_mask: u64) {
 /// State is restored based on bits [62:0] in `rs_mask`, `XCR0`, and
 /// `mem_addr.HEADER.XSTATE_BV`. `mem_addr` must be aligned on a 64-byte
 /// boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xrstor))]
 pub unsafe fn _xrstor(mem_addr: *const u8, rs_mask: u64) {
@ -62,7 +62,7 @@ const _XCR_XFEATURE_ENABLED_MASK: u32 = 0;
 /// by `a`.
 ///
 /// Currently only `XFEATURE_ENABLED_MASK` `XCR` is supported.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xsetbv))]
 pub unsafe fn _xsetbv(a: u32, val: u64) {
@ -71,7 +71,7 @@ pub unsafe fn _xsetbv(a: u32, val: u64) {

 /// Reads the contents of the extended control register `XCR`
 /// specified in `xcr_no`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xgetbv))]
 pub unsafe fn _xgetbv(xcr_no: u32) -> u64 {
@ -85,7 +85,7 @@ pub unsafe fn _xgetbv(xcr_no: u32) -> u64 {
 /// `mem_addr` must be aligned on a 64-byte boundary. The hardware may optimize
 /// the manner in which data is saved. The performance of this instruction will
 /// be equal to or better than using the `XSAVE` instruction.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaveopt")]
 #[cfg_attr(test, assert_instr(xsaveopt))]
 pub unsafe fn _xsaveopt(mem_addr: *mut u8, save_mask: u64) {
@ -98,7 +98,7 @@ pub unsafe fn _xsaveopt(mem_addr: *mut u8, save_mask: u64) {
 /// `xsavec` differs from `xsave` in that it uses compaction and that it may
 /// use init optimization. State is saved based on bits [62:0] in `save_mask`
 /// and `XCR0`. `mem_addr` must be aligned on a 64-byte boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsavec")]
 #[cfg_attr(test, assert_instr(xsavec))]
 pub unsafe fn _xsavec(mem_addr: *mut u8, save_mask: u64) {
@ -112,7 +112,7 @@ pub unsafe fn _xsavec(mem_addr: *mut u8, save_mask: u64) {
 /// corresponding to bits set in `IA32_XSS` `MSR` and that it may use the
 /// modified optimization. State is saved based on bits [62:0] in `save_mask`
 /// and `XCR0`. `mem_addr` must be aligned on a 64-byte boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaves")]
 #[cfg_attr(test, assert_instr(xsaves))]
 pub unsafe fn _xsaves(mem_addr: *mut u8, save_mask: u64) {
@ -128,7 +128,7 @@ pub unsafe fn _xsaves(mem_addr: *mut u8, save_mask: u64) {
 /// State is restored based on bits [62:0] in `rs_mask`, `XCR0`, and
 /// `mem_addr.HEADER.XSTATE_BV`. `mem_addr` must be aligned on a 64-byte
 /// boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaves")]
 #[cfg_attr(test, assert_instr(xrstors))]
 pub unsafe fn _xrstors(mem_addr: *const u8, rs_mask: u64) {
--- a/library/stdarch/coresimd/src/x86/i686/mmx.rs
+++ b/library/stdarch/coresimd/src/x86/i686/mmx.rs
@ -16,7 +16,7 @@ use core::mem;
 use stdsimd_test::assert_instr;

 /// Constructs a 64-bit integer vector initialized to zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 // FIXME: this produces a movl instead of xorps on x86
 // FIXME: this produces a xor intrinsic instead of xorps on x86_64
@ -26,7 +26,7 @@ pub unsafe fn _mm_setzero_si64() -> __m64 {
 }

 /// Add packed 8-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddb))]
 pub unsafe fn _mm_add_pi8(a: __m64, b: __m64) -> __m64 {
@ -34,7 +34,7 @@ pub unsafe fn _mm_add_pi8(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 8-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddb))]
 pub unsafe fn _m_paddb(a: __m64, b: __m64) -> __m64 {
@ -42,7 +42,7 @@ pub unsafe fn _m_paddb(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 16-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddw))]
 pub unsafe fn _mm_add_pi16(a: __m64, b: __m64) -> __m64 {
@ -50,7 +50,7 @@ pub unsafe fn _mm_add_pi16(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 16-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddw))]
 pub unsafe fn _m_paddw(a: __m64, b: __m64) -> __m64 {
@ -58,7 +58,7 @@ pub unsafe fn _m_paddw(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 32-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddd))]
 pub unsafe fn _mm_add_pi32(a: __m64, b: __m64) -> __m64 {
@ -66,7 +66,7 @@ pub unsafe fn _mm_add_pi32(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 32-bit integers in `a` and `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddd))]
 pub unsafe fn _m_paddd(a: __m64, b: __m64) -> __m64 {
@ -74,7 +74,7 @@ pub unsafe fn _m_paddd(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 8-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddsb))]
 pub unsafe fn _mm_adds_pi8(a: __m64, b: __m64) -> __m64 {
@ -82,7 +82,7 @@ pub unsafe fn _mm_adds_pi8(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 8-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddsb))]
 pub unsafe fn _m_paddsb(a: __m64, b: __m64) -> __m64 {
@ -90,7 +90,7 @@ pub unsafe fn _m_paddsb(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 16-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddsw))]
 pub unsafe fn _mm_adds_pi16(a: __m64, b: __m64) -> __m64 {
@ -98,7 +98,7 @@ pub unsafe fn _mm_adds_pi16(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed 16-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddsw))]
 pub unsafe fn _m_paddsw(a: __m64, b: __m64) -> __m64 {
@ -106,7 +106,7 @@ pub unsafe fn _m_paddsw(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed unsigned 8-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddusb))]
 pub unsafe fn _mm_adds_pu8(a: __m64, b: __m64) -> __m64 {
@ -114,7 +114,7 @@ pub unsafe fn _mm_adds_pu8(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed unsigned 8-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddusb))]
 pub unsafe fn _m_paddusb(a: __m64, b: __m64) -> __m64 {
@ -122,7 +122,7 @@ pub unsafe fn _m_paddusb(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed unsigned 16-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddusw))]
 pub unsafe fn _mm_adds_pu16(a: __m64, b: __m64) -> __m64 {
@ -130,7 +130,7 @@ pub unsafe fn _mm_adds_pu16(a: __m64, b: __m64) -> __m64 {
 }

 /// Add packed unsigned 16-bit integers in `a` and `b` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(paddusw))]
 pub unsafe fn _m_paddusw(a: __m64, b: __m64) -> __m64 {
@ -138,7 +138,7 @@ pub unsafe fn _m_paddusw(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 8-bit integers in `b` from packed 8-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubb))]
 pub unsafe fn _mm_sub_pi8(a: __m64, b: __m64) -> __m64 {
@ -146,7 +146,7 @@ pub unsafe fn _mm_sub_pi8(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 8-bit integers in `b` from packed 8-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubb))]
 pub unsafe fn _m_psubb(a: __m64, b: __m64) -> __m64 {
@ -154,7 +154,7 @@ pub unsafe fn _m_psubb(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 16-bit integers in `b` from packed 16-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubw))]
 pub unsafe fn _mm_sub_pi16(a: __m64, b: __m64) -> __m64 {
@ -162,7 +162,7 @@ pub unsafe fn _mm_sub_pi16(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 16-bit integers in `b` from packed 16-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubw))]
 pub unsafe fn _m_psubw(a: __m64, b: __m64) -> __m64 {
@ -170,7 +170,7 @@ pub unsafe fn _m_psubw(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 32-bit integers in `b` from packed 32-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubd))]
 pub unsafe fn _mm_sub_pi32(a: __m64, b: __m64) -> __m64 {
@ -178,7 +178,7 @@ pub unsafe fn _mm_sub_pi32(a: __m64, b: __m64) -> __m64 {
 }

 /// Subtract packed 32-bit integers in `b` from packed 32-bit integers in `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubd))]
 pub unsafe fn _m_psubd(a: __m64, b: __m64) -> __m64 {
@ -187,7 +187,7 @@ pub unsafe fn _m_psubd(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed 8-bit integers in `b` from packed 8-bit integers in `a`
 /// using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubsb))]
 pub unsafe fn _mm_subs_pi8(a: __m64, b: __m64) -> __m64 {
@ -196,7 +196,7 @@ pub unsafe fn _mm_subs_pi8(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed 8-bit integers in `b` from packed 8-bit integers in `a`
 /// using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubsb))]
 pub unsafe fn _m_psubsb(a: __m64, b: __m64) -> __m64 {
@ -205,7 +205,7 @@ pub unsafe fn _m_psubsb(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed 16-bit integers in `b` from packed 16-bit integers in `a`
 /// using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubsw))]
 pub unsafe fn _mm_subs_pi16(a: __m64, b: __m64) -> __m64 {
@ -214,7 +214,7 @@ pub unsafe fn _mm_subs_pi16(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed 16-bit integers in `b` from packed 16-bit integers in `a`
 /// using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubsw))]
 pub unsafe fn _m_psubsw(a: __m64, b: __m64) -> __m64 {
@ -223,7 +223,7 @@ pub unsafe fn _m_psubsw(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed unsigned 8-bit integers in `b` from packed unsigned 8-bit
 /// integers in `a` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubusb))]
 pub unsafe fn _mm_subs_pu8(a: __m64, b: __m64) -> __m64 {
@ -232,7 +232,7 @@ pub unsafe fn _mm_subs_pu8(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed unsigned 8-bit integers in `b` from packed unsigned 8-bit
 /// integers in `a` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubusb))]
 pub unsafe fn _m_psubusb(a: __m64, b: __m64) -> __m64 {
@ -241,7 +241,7 @@ pub unsafe fn _m_psubusb(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed unsigned 16-bit integers in `b` from packed unsigned
 /// 16-bit integers in `a` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubusw))]
 pub unsafe fn _mm_subs_pu16(a: __m64, b: __m64) -> __m64 {
@ -250,7 +250,7 @@ pub unsafe fn _mm_subs_pu16(a: __m64, b: __m64) -> __m64 {

 /// Subtract packed unsigned 16-bit integers in `b` from packed unsigned
 /// 16-bit integers in `a` using saturation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(psubusw))]
 pub unsafe fn _m_psubusw(a: __m64, b: __m64) -> __m64 {
@ -262,7 +262,7 @@ pub unsafe fn _m_psubusw(a: __m64, b: __m64) -> __m64 {
 ///
 /// Positive values greater than 0x7F are saturated to 0x7F. Negative values
 /// less than 0x80 are saturated to 0x80.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(packsswb))]
 pub unsafe fn _mm_packs_pi16(a: __m64, b: __m64) -> __m64 {
@ -274,7 +274,7 @@ pub unsafe fn _mm_packs_pi16(a: __m64, b: __m64) -> __m64 {
 ///
 /// Positive values greater than 0x7F are saturated to 0x7F. Negative values
 /// less than 0x80 are saturated to 0x80.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(packssdw))]
 pub unsafe fn _mm_packs_pi32(a: __m64, b: __m64) -> __m64 {
@ -283,7 +283,7 @@ pub unsafe fn _mm_packs_pi32(a: __m64, b: __m64) -> __m64 {

 /// Compares whether each element of `a` is greater than the corresponding
 /// element of `b` returning `0` for `false` and `-1` for `true`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(pcmpgtb))]
 pub unsafe fn _mm_cmpgt_pi8(a: __m64, b: __m64) -> __m64 {
@ -292,7 +292,7 @@ pub unsafe fn _mm_cmpgt_pi8(a: __m64, b: __m64) -> __m64 {

 /// Compares whether each element of `a` is greater than the corresponding
 /// element of `b` returning `0` for `false` and `-1` for `true`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(pcmpgtw))]
 pub unsafe fn _mm_cmpgt_pi16(a: __m64, b: __m64) -> __m64 {
@ -301,7 +301,7 @@ pub unsafe fn _mm_cmpgt_pi16(a: __m64, b: __m64) -> __m64 {

 /// Compares whether each element of `a` is greater than the corresponding
 /// element of `b` returning `0` for `false` and `-1` for `true`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(pcmpgtd))]
 pub unsafe fn _mm_cmpgt_pi32(a: __m64, b: __m64) -> __m64 {
@ -310,7 +310,7 @@ pub unsafe fn _mm_cmpgt_pi32(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the upper two elements from two `i16x4` vectors and interleaves
 /// them into the result: `[a.2, b.2, a.3, b.3]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpckhwd))] // FIXME punpcklbw expected
 pub unsafe fn _mm_unpackhi_pi16(a: __m64, b: __m64) -> __m64 {
@ -319,7 +319,7 @@ pub unsafe fn _mm_unpackhi_pi16(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the upper four elements from two `i8x8` vectors and interleaves
 /// them into the result: `[a.4, b.4, a.5, b.5, a.6, b.6, a.7, b.7]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpckhbw))]
 pub unsafe fn _mm_unpackhi_pi8(a: __m64, b: __m64) -> __m64 {
@ -328,7 +328,7 @@ pub unsafe fn _mm_unpackhi_pi8(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the lower four elements from two `i8x8` vectors and interleaves
 /// them into the result: `[a.0, b.0, a.1, b.1, a.2, b.2, a.3, b.3]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpcklbw))]
 pub unsafe fn _mm_unpacklo_pi8(a: __m64, b: __m64) -> __m64 {
@ -337,7 +337,7 @@ pub unsafe fn _mm_unpacklo_pi8(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the lower two elements from two `i16x4` vectors and interleaves
 /// them into the result: `[a.0 b.0 a.1 b.1]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpcklwd))]
 pub unsafe fn _mm_unpacklo_pi16(a: __m64, b: __m64) -> __m64 {
@ -346,7 +346,7 @@ pub unsafe fn _mm_unpacklo_pi16(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the upper element from two `i32x2` vectors and interleaves them
 /// into the result: `[a.1, b.1]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpckhdq))]
 pub unsafe fn _mm_unpackhi_pi32(a: __m64, b: __m64) -> __m64 {
@ -355,7 +355,7 @@ pub unsafe fn _mm_unpackhi_pi32(a: __m64, b: __m64) -> __m64 {

 /// Unpacks the lower element from two `i32x2` vectors and interleaves them
 /// into the result: `[a.0, b.0]`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 #[cfg_attr(test, assert_instr(punpckldq))]
 pub unsafe fn _mm_unpacklo_pi32(a: __m64, b: __m64) -> __m64 {
@ -363,21 +363,21 @@ pub unsafe fn _mm_unpacklo_pi32(a: __m64, b: __m64) -> __m64 {
 }

 /// Set packed 16-bit integers in dst with the supplied values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set_pi16(e3: i16, e2: i16, e1: i16, e0: i16) -> __m64 {
    _mm_setr_pi16(e0, e1, e2, e3)
 }

 /// Set packed 32-bit integers in dst with the supplied values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set_pi32(e1: i32, e0: i32) -> __m64 {
    _mm_setr_pi32(e0, e1)
 }

 /// Set packed 8-bit integers in dst with the supplied values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set_pi8(
    e7: i8, e6: i8, e5: i8, e4: i8, e3: i8, e2: i8, e1: i8, e0: i8
@ -386,21 +386,21 @@ pub unsafe fn _mm_set_pi8(
 }

 /// Broadcast 16-bit integer a to all all elements of dst.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set1_pi16(a: i16) -> __m64 {
    _mm_setr_pi16(a, a, a, a)
 }

 /// Broadcast 32-bit integer a to all all elements of dst.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set1_pi32(a: i32) -> __m64 {
    _mm_setr_pi32(a, a)
 }

 /// Broadcast 8-bit integer a to all all elements of dst.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_set1_pi8(a: i8) -> __m64 {
    _mm_setr_pi8(a, a, a, a, a, a, a, a)
@ -408,7 +408,7 @@ pub unsafe fn _mm_set1_pi8(a: i8) -> __m64 {

 /// Set packed 16-bit integers in dst with the supplied values in reverse
 /// order.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_setr_pi16(e0: i16, e1: i16, e2: i16, e3: i16) -> __m64 {
    mem::transmute(i16x4::new(e0, e1, e2, e3))
@ -416,14 +416,14 @@ pub unsafe fn _mm_setr_pi16(e0: i16, e1: i16, e2: i16, e3: i16) -> __m64 {

 /// Set packed 32-bit integers in dst with the supplied values in reverse
 /// order.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_setr_pi32(e0: i32, e1: i32) -> __m64 {
    mem::transmute(i32x2::new(e0, e1))
 }

 /// Set packed 8-bit integers in dst with the supplied values in reverse order.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "mmx")]
 pub unsafe fn _mm_setr_pi8(
    e0: i8, e1: i8, e2: i8, e3: i8, e4: i8, e5: i8, e6: i8, e7: i8
--- a/library/stdarch/coresimd/src/x86/i686/sse2.rs
+++ b/library/stdarch/coresimd/src/x86/i686/sse2.rs
@ -10,7 +10,7 @@ use stdsimd_test::assert_instr;

 /// Adds two signed or unsigned 64-bit integer values, returning the
 /// lower 64 bits of the sum.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(paddq))]
 pub unsafe fn _mm_add_si64(a: __m64, b: __m64) -> __m64 {
@ -20,7 +20,7 @@ pub unsafe fn _mm_add_si64(a: __m64, b: __m64) -> __m64 {
 /// Multiplies 32-bit unsigned integer values contained in the lower bits
 /// of the two 64-bit integer vectors and returns the 64-bit unsigned
 /// product.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(pmuludq))]
 pub unsafe fn _mm_mul_su32(a: __m64, b: __m64) -> __m64 {
@ -29,7 +29,7 @@ pub unsafe fn _mm_mul_su32(a: __m64, b: __m64) -> __m64 {

 /// Subtracts signed or unsigned 64-bit integer values and writes the
 /// difference to the corresponding bits in the destination.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(psubq))]
 pub unsafe fn _mm_sub_si64(a: __m64, b: __m64) -> __m64 {
@ -39,7 +39,7 @@ pub unsafe fn _mm_sub_si64(a: __m64, b: __m64) -> __m64 {
 /// Converts the two signed 32-bit integer elements of a 64-bit vector of
 /// [2 x i32] into two double-precision floating-point values, returned in a
 /// 128-bit vector of [2 x double].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(cvtpi2pd))]
 pub unsafe fn _mm_cvtpi32_pd(a: __m64) -> __m128d {
@ -48,7 +48,7 @@ pub unsafe fn _mm_cvtpi32_pd(a: __m64) -> __m128d {

 /// Initializes both 64-bit values in a 128-bit vector of [2 x i64] with
 /// the specified 64-bit integer values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 // no particular instruction to test
 pub unsafe fn _mm_set_epi64(e1: __m64, e0: __m64) -> __m128i {
@ -57,7 +57,7 @@ pub unsafe fn _mm_set_epi64(e1: __m64, e0: __m64) -> __m128i {

 /// Initializes both values in a 128-bit vector of [2 x i64] with the
 /// specified 64-bit value.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 // no particular instruction to test
 pub unsafe fn _mm_set1_epi64(a: __m64) -> __m128i {
@ -66,7 +66,7 @@ pub unsafe fn _mm_set1_epi64(a: __m64) -> __m128i {

 /// Constructs a 128-bit integer vector, initialized in reverse order
 /// with the specified 64-bit integral values.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 // no particular instruction to test
 pub unsafe fn _mm_setr_epi64(e1: __m64, e0: __m64) -> __m128i {
@ -75,7 +75,7 @@ pub unsafe fn _mm_setr_epi64(e1: __m64, e0: __m64) -> __m128i {

 /// Returns the lower 64 bits of a 128-bit integer vector as a 64-bit
 /// integer.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 // #[cfg_attr(test, assert_instr(movdq2q))] // FIXME: llvm codegens wrong
 // instr?
@ -85,7 +85,7 @@ pub unsafe fn _mm_movepi64_pi64(a: __m128i) -> __m64 {

 /// Moves the 64-bit operand to a 128-bit integer vector, zeroing the
 /// upper bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 // #[cfg_attr(test, assert_instr(movq2dq))] // FIXME: llvm codegens wrong
 // instr?
@ -96,7 +96,7 @@ pub unsafe fn _mm_movpi64_epi64(a: __m64) -> __m128i {
 /// Converts the two double-precision floating-point elements of a
 /// 128-bit vector of [2 x double] into two signed 32-bit integer values,
 /// returned in a 64-bit vector of [2 x i32].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(cvtpd2pi))]
 pub unsafe fn _mm_cvtpd_pi32(a: __m128d) -> __m64 {
@ -108,7 +108,7 @@ pub unsafe fn _mm_cvtpd_pi32(a: __m128d) -> __m64 {
 /// returned in a 64-bit vector of [2 x i32].
 /// If the result of either conversion is inexact, the result is truncated
 /// (rounded towards zero) regardless of the current MXCSR setting.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2,mmx")]
 #[cfg_attr(test, assert_instr(cvttpd2pi))]
 pub unsafe fn _mm_cvttpd_pi32(a: __m128d) -> __m64 {
--- a/library/stdarch/coresimd/src/x86/i686/sse41.rs
+++ b/library/stdarch/coresimd/src/x86/i686/sse41.rs
@ -29,7 +29,7 @@ extern "C" {
 ///
 /// * `1` - if the specified bits are all zeros,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(ptest))]
 pub unsafe fn _mm_testz_si128(a: __m128i, mask: __m128i) -> i32 {
@ -49,7 +49,7 @@ pub unsafe fn _mm_testz_si128(a: __m128i, mask: __m128i) -> i32 {
 ///
 /// * `1` - if the specified bits are all ones,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(ptest))]
 pub unsafe fn _mm_testc_si128(a: __m128i, mask: __m128i) -> i32 {
@ -69,7 +69,7 @@ pub unsafe fn _mm_testc_si128(a: __m128i, mask: __m128i) -> i32 {
 ///
 /// * `1` - if the specified bits are neither all zeros nor all ones,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(ptest))]
 pub unsafe fn _mm_testnzc_si128(a: __m128i, mask: __m128i) -> i32 {
@ -89,7 +89,7 @@ pub unsafe fn _mm_testnzc_si128(a: __m128i, mask: __m128i) -> i32 {
 ///
 /// * `1` - if the specified bits are all zeros,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(ptest))]
 pub unsafe fn _mm_test_all_zeros(a: __m128i, mask: __m128i) -> i32 {
@ -107,7 +107,7 @@ pub unsafe fn _mm_test_all_zeros(a: __m128i, mask: __m128i) -> i32 {
 ///
 /// * `1` - if the bits specified in the operand are all set to 1,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pcmpeqd))]
 #[cfg_attr(test, assert_instr(ptest))]
@ -128,7 +128,7 @@ pub unsafe fn _mm_test_all_ones(a: __m128i) -> i32 {
 ///
 /// * `1` - if the specified bits are neither all zeros nor all ones,
 /// * `0` - otherwise.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(ptest))]
 pub unsafe fn _mm_test_mix_ones_zeros(a: __m128i, mask: __m128i) -> i32 {
--- a/library/stdarch/coresimd/src/x86/i686/sse42.rs
+++ b/library/stdarch/coresimd/src/x86/i686/sse42.rs
@ -9,7 +9,7 @@ use stdsimd_test::assert_instr;

 /// Compare packed 64-bit integers in `a` and `b` for greater-than,
 /// return the results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(pcmpgtq))]
 pub unsafe fn _mm_cmpgt_epi64(a: __m128i, b: __m128i) -> __m128i {
--- a/library/stdarch/coresimd/src/x86/i686/sse4a.rs
+++ b/library/stdarch/coresimd/src/x86/i686/sse4a.rs
@ -33,7 +33,7 @@ extern "C" {
 ///
 /// If `length == 0 && index > 0` or `lenght + index > 64` the result is
 /// undefined.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4a")]
 #[cfg_attr(test, assert_instr(extrq))]
 pub unsafe fn _mm_extract_si64(x: __m128i, y: __m128i) -> __m128i {
@ -49,7 +49,7 @@ pub unsafe fn _mm_extract_si64(x: __m128i, y: __m128i) -> __m128i {
 ///
 /// If the `length` is zero it is interpreted as `64`. If `index + length > 64`
 /// or `index > 0 && length == 0` the result is undefined.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4a")]
 #[cfg_attr(test, assert_instr(insertq))]
 pub unsafe fn _mm_insert_si64(x: __m128i, y: __m128i) -> __m128i {
@ -57,7 +57,7 @@ pub unsafe fn _mm_insert_si64(x: __m128i, y: __m128i) -> __m128i {
 }

 /// Non-temporal store of `a.0` into `p`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4a")]
 #[cfg_attr(test, assert_instr(movntsd))]
 pub unsafe fn _mm_stream_sd(p: *mut f64, a: __m128d) {
@ -65,7 +65,7 @@ pub unsafe fn _mm_stream_sd(p: *mut f64, a: __m128d) {
 }

 /// Non-temporal store of `a.0` into `p`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4a")]
 #[cfg_attr(test, assert_instr(movntss))]
 pub unsafe fn _mm_stream_ss(p: *mut f32, a: __m128) {
--- a/library/stdarch/coresimd/src/x86/i686/ssse3.rs
+++ b/library/stdarch/coresimd/src/x86/i686/ssse3.rs
@ -7,7 +7,7 @@ use x86::*;

 /// Compute the absolute value of packed 8-bit integers in `a` and
 /// return the unsigned results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pabsb))]
 pub unsafe fn _mm_abs_pi8(a: __m64) -> __m64 {
@ -16,7 +16,7 @@ pub unsafe fn _mm_abs_pi8(a: __m64) -> __m64 {

 /// Compute the absolute value of packed 8-bit integers in `a`, and return the
 /// unsigned results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pabsw))]
 pub unsafe fn _mm_abs_pi16(a: __m64) -> __m64 {
@ -25,7 +25,7 @@ pub unsafe fn _mm_abs_pi16(a: __m64) -> __m64 {

 /// Compute the absolute value of packed 32-bit integers in `a`, and return the
 /// unsigned results.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pabsd))]
 pub unsafe fn _mm_abs_pi32(a: __m64) -> __m64 {
@ -34,7 +34,7 @@ pub unsafe fn _mm_abs_pi32(a: __m64) -> __m64 {

 /// Shuffle packed 8-bit integers in `a` according to shuffle control mask in
 /// the corresponding 8-bit element of `b`, and return the results
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pshufb))]
 pub unsafe fn _mm_shuffle_pi8(a: __m64, b: __m64) -> __m64 {
@ -43,7 +43,7 @@ pub unsafe fn _mm_shuffle_pi8(a: __m64, b: __m64) -> __m64 {

 /// Concatenates the two 64-bit integer vector operands, and right-shifts
 /// the result by the number of bytes specified in the immediate operand.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(palignr, n = 15))]
 pub unsafe fn _mm_alignr_pi8(a: __m64, b: __m64, n: i32) -> __m64 {
@ -57,7 +57,7 @@ pub unsafe fn _mm_alignr_pi8(a: __m64, b: __m64, n: i32) -> __m64 {

 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 64-bit vectors of [4 x i16].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phaddw))]
 pub unsafe fn _mm_hadd_pi16(a: __m64, b: __m64) -> __m64 {
@ -66,7 +66,7 @@ pub unsafe fn _mm_hadd_pi16(a: __m64, b: __m64) -> __m64 {

 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 64-bit vectors of [2 x i32].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phaddd))]
 pub unsafe fn _mm_hadd_pi32(a: __m64, b: __m64) -> __m64 {
@ -76,7 +76,7 @@ pub unsafe fn _mm_hadd_pi32(a: __m64, b: __m64) -> __m64 {
 /// Horizontally add the adjacent pairs of values contained in 2 packed
 /// 64-bit vectors of [4 x i16]. Positive sums greater than 7FFFh are
 /// saturated to 7FFFh. Negative sums less than 8000h are saturated to 8000h.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phaddsw))]
 pub unsafe fn _mm_hadds_pi16(a: __m64, b: __m64) -> __m64 {
@ -85,7 +85,7 @@ pub unsafe fn _mm_hadds_pi16(a: __m64, b: __m64) -> __m64 {

 /// Horizontally subtracts the adjacent pairs of values contained in 2
 /// packed 64-bit vectors of [4 x i16].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phsubw))]
 pub unsafe fn _mm_hsub_pi16(a: __m64, b: __m64) -> __m64 {
@ -94,7 +94,7 @@ pub unsafe fn _mm_hsub_pi16(a: __m64, b: __m64) -> __m64 {

 /// Horizontally subtracts the adjacent pairs of values contained in 2
 /// packed 64-bit vectors of [2 x i32].
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phsubd))]
 pub unsafe fn _mm_hsub_pi32(a: __m64, b: __m64) -> __m64 {
@ -105,7 +105,7 @@ pub unsafe fn _mm_hsub_pi32(a: __m64, b: __m64) -> __m64 {
 /// packed 64-bit vectors of [4 x i16]. Positive differences greater than
 /// 7FFFh are saturated to 7FFFh. Negative differences less than 8000h are
 /// saturated to 8000h.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(phsubsw))]
 pub unsafe fn _mm_hsubs_pi16(a: __m64, b: __m64) -> __m64 {
@ -117,7 +117,7 @@ pub unsafe fn _mm_hsubs_pi16(a: __m64, b: __m64) -> __m64 {
 /// integer values contained in the second source operand, adds pairs of
 /// contiguous products with signed saturation, and writes the 16-bit sums to
 /// the corresponding bits in the destination.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pmaddubsw))]
 pub unsafe fn _mm_maddubs_pi16(a: __m64, b: __m64) -> __m64 {
@ -127,7 +127,7 @@ pub unsafe fn _mm_maddubs_pi16(a: __m64, b: __m64) -> __m64 {
 /// Multiplies packed 16-bit signed integer values, truncates the 32-bit
 /// products to the 18 most significant bits by right-shifting, rounds the
 /// truncated value by adding 1, and writes bits [16:1] to the destination.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(pmulhrsw))]
 pub unsafe fn _mm_mulhrs_pi16(a: __m64, b: __m64) -> __m64 {
@ -138,7 +138,7 @@ pub unsafe fn _mm_mulhrs_pi16(a: __m64, b: __m64) -> __m64 {
 /// integer in `b` is negative, and return the results.
 /// Element in result are zeroed out when the corresponding element in `b` is
 /// zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(psignb))]
 pub unsafe fn _mm_sign_pi8(a: __m64, b: __m64) -> __m64 {
@ -149,7 +149,7 @@ pub unsafe fn _mm_sign_pi8(a: __m64, b: __m64) -> __m64 {
 /// integer in `b` is negative, and return the results.
 /// Element in result are zeroed out when the corresponding element in `b` is
 /// zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(psignw))]
 pub unsafe fn _mm_sign_pi16(a: __m64, b: __m64) -> __m64 {
@ -160,7 +160,7 @@ pub unsafe fn _mm_sign_pi16(a: __m64, b: __m64) -> __m64 {
 /// integer in `b` is negative, and return the results.
 /// Element in result are zeroed out when the corresponding element in `b` is
 /// zero.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "ssse3,mmx")]
 #[cfg_attr(test, assert_instr(psignd))]
 pub unsafe fn _mm_sign_pi32(a: __m64, b: __m64) -> __m64 {
--- a/library/stdarch/coresimd/src/x86/mod.rs
+++ b/library/stdarch/coresimd/src/x86/mod.rs
@ -17,7 +17,7 @@ macro_rules! types {
        pub struct $name($($fields)*);

        impl Clone for $name {
-            #[inline(always)] // currently needed for correctness
+            #[inline] // currently needed for correctness
            fn clone(&self) -> $name {
                *self
            }
@ -307,49 +307,49 @@ pub use self::test::*;
 trait m128iExt: Sized {
    fn as_m128i(self) -> __m128i;

-    #[inline(always)]
+    #[inline]
    fn as_u8x16(self) -> ::v128::u8x16 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u16x8(self) -> ::v128::u16x8 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u32x4(self) -> ::v128::u32x4 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u64x2(self) -> ::v128::u64x2 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i8x16(self) -> ::v128::i8x16 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i16x8(self) -> ::v128::i16x8 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i32x4(self) -> ::v128::i32x4 {
        unsafe { mem::transmute(self.as_m128i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i64x2(self) -> ::v128::i64x2 {
        unsafe { mem::transmute(self.as_m128i()) }
    }
 }

 impl m128iExt for __m128i {
-    #[inline(always)]
+    #[inline]
    fn as_m128i(self) -> __m128i { self }
 }

@ -358,49 +358,49 @@ impl m128iExt for __m128i {
 trait m256iExt: Sized {
    fn as_m256i(self) -> __m256i;

-    #[inline(always)]
+    #[inline]
    fn as_u8x32(self) -> ::v256::u8x32 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u16x16(self) -> ::v256::u16x16 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u32x8(self) -> ::v256::u32x8 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_u64x4(self) -> ::v256::u64x4 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i8x32(self) -> ::v256::i8x32 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i16x16(self) -> ::v256::i16x16 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i32x8(self) -> ::v256::i32x8 {
        unsafe { mem::transmute(self.as_m256i()) }
    }

-    #[inline(always)]
+    #[inline]
    fn as_i64x4(self) -> ::v256::i64x4 {
        unsafe { mem::transmute(self.as_m256i()) }
    }
 }

 impl m256iExt for __m256i {
-    #[inline(always)]
+    #[inline]
    fn as_m256i(self) -> __m256i { self }
 }

--- a/library/stdarch/coresimd/src/x86/x86_64/abm.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/abm.rs
@ -4,7 +4,7 @@ use stdsimd_test::assert_instr;
 /// Counts the leading most significant zero bits.
 ///
 /// When the operand is zero, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "lzcnt")]
 #[cfg_attr(test, assert_instr(lzcnt))]
 pub unsafe fn _lzcnt_u64(x: u64) -> u64 {
@ -12,7 +12,7 @@ pub unsafe fn _lzcnt_u64(x: u64) -> u64 {
 }

 /// Counts the bits that are set.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "popcnt")]
 #[cfg_attr(test, assert_instr(popcnt))]
 pub unsafe fn _popcnt64(x: i64) -> i32 {
--- a/library/stdarch/coresimd/src/x86/x86_64/avx.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/avx.rs
@ -5,7 +5,7 @@ use x86::*;

 /// Copy `a` to result, and insert the 64-bit integer `i` into result
 /// at the location specified by `index`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "avx")]
 // This intrinsic has no corresponding instruction.
 pub unsafe fn _mm256_insert_epi64(a: __m256i, i: i64, index: i32) -> __m256i {
--- a/library/stdarch/coresimd/src/x86/x86_64/avx2.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/avx2.rs
@ -2,7 +2,7 @@ use simd_llvm::*;
 use x86::*;

 /// Extract a 64-bit integer from `a`, selected with `imm8`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "avx2")]
 // This intrinsic has no corresponding instruction.
 pub unsafe fn _mm256_extract_epi64(a: __m256i, imm8: i32) -> i64 {
--- a/library/stdarch/coresimd/src/x86/x86_64/bmi.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/bmi.rs
@ -3,7 +3,7 @@ use stdsimd_test::assert_instr;

 /// Extracts bits in range [`start`, `start` + `length`) from `a` into
 /// the least significant bits of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(bextr))]
 #[cfg(not(target_arch = "x86"))]
@ -16,7 +16,7 @@ pub unsafe fn _bextr_u64(a: u64, start: u32, len: u32) -> u64 {
 ///
 /// Bits [7,0] of `control` specify the index to the first bit in the range to
 /// be extracted, and bits [15,8] specify the length of the range.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(bextr))]
 #[cfg(not(target_arch = "x86"))]
@ -25,7 +25,7 @@ pub unsafe fn _bextr2_u64(a: u64, control: u64) -> u64 {
 }

 /// Bitwise logical `AND` of inverted `a` with `b`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(andn))]
 pub unsafe fn _andn_u64(a: u64, b: u64) -> u64 {
@ -33,7 +33,7 @@ pub unsafe fn _andn_u64(a: u64, b: u64) -> u64 {
 }

 /// Extract lowest set isolated bit.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsi))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -42,7 +42,7 @@ pub unsafe fn _blsi_u64(x: u64) -> u64 {
 }

 /// Get mask up to lowest set bit.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsmsk))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -53,7 +53,7 @@ pub unsafe fn _blsmsk_u64(x: u64) -> u64 {
 /// Resets the lowest set bit of `x`.
 ///
 /// If `x` is sets CF.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(blsr))]
 #[cfg(not(target_arch = "x86"))] // generates lots of instructions
@ -64,7 +64,7 @@ pub unsafe fn _blsr_u64(x: u64) -> u64 {
 /// Counts the number of trailing least significant zero bits.
 ///
 /// When the source operand is 0, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(tzcnt))]
 pub unsafe fn _tzcnt_u64(x: u64) -> u64 {
@ -74,7 +74,7 @@ pub unsafe fn _tzcnt_u64(x: u64) -> u64 {
 /// Counts the number of trailing least significant zero bits.
 ///
 /// When the source operand is 0, it returns its size in bits.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi")]
 #[cfg_attr(test, assert_instr(tzcnt))]
 pub unsafe fn _mm_tzcnt_64(x: u64) -> i64 {
--- a/library/stdarch/coresimd/src/x86/x86_64/bmi2.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/bmi2.rs
@ -5,7 +5,7 @@ use stdsimd_test::assert_instr;
 ///
 /// Unsigned multiplication of `a` with `b` returning a pair `(lo, hi)` with
 /// the low half and the high half of the result.
-#[inline(always)]
+#[inline]
 #[cfg_attr(test, assert_instr(mulx))]
 #[target_feature(enable = "bmi2")]
 #[cfg(not(target_arch = "x86"))] // calls an intrinsic
@ -16,7 +16,7 @@ pub unsafe fn _mulx_u64(a: u64, b: u64, hi: &mut u64) -> u64 {
 }

 /// Zero higher bits of `a` >= `index`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(bzhi))]
 #[cfg(not(target_arch = "x86"))]
@ -26,7 +26,7 @@ pub unsafe fn _bzhi_u64(a: u64, index: u32) -> u64 {

 /// Scatter contiguous low order bits of `a` to the result at the positions
 /// specified by the `mask`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(pdep))]
 #[cfg(not(target_arch = "x86"))]
@ -36,7 +36,7 @@ pub unsafe fn _pdep_u64(a: u64, mask: u64) -> u64 {

 /// Gathers the bits of `x` specified by the `mask` into the contiguous low
 /// order bit positions of the result.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "bmi2")]
 #[cfg_attr(test, assert_instr(pext))]
 #[cfg(not(target_arch = "x86"))]
--- a/library/stdarch/coresimd/src/x86/x86_64/fxsr.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/fxsr.rs
@ -21,7 +21,7 @@ extern "C" {
 ///
 /// [fxsave]: http://www.felixcloutier.com/x86/FXSAVE.html
 /// [fxrstor]: http://www.felixcloutier.com/x86/FXRSTOR.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "fxsr")]
 #[cfg_attr(test, assert_instr(fxsave64))]
 pub unsafe fn _fxsave64(mem_addr: *mut u8) {
@ -42,7 +42,7 @@ pub unsafe fn _fxsave64(mem_addr: *mut u8) {
 ///
 /// [fxsave]: http://www.felixcloutier.com/x86/FXSAVE.html
 /// [fxrstor]: http://www.felixcloutier.com/x86/FXRSTOR.html
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "fxsr")]
 #[cfg_attr(test, assert_instr(fxrstor64))]
 pub unsafe fn _fxrstor64(mem_addr: *const u8) {
--- a/library/stdarch/coresimd/src/x86/x86_64/sse.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/sse.rs
@ -24,7 +24,7 @@ extern "C" {
 /// [`_mm_setcsr`](fn._mm_setcsr.html)).
 ///
 /// This corresponds to the `CVTSS2SI` instruction (with 64 bit output).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtss2si))]
 pub unsafe fn _mm_cvtss_si64(a: __m128) -> i64 {
@ -40,7 +40,7 @@ pub unsafe fn _mm_cvtss_si64(a: __m128) -> i64 {
 /// point exception if unmasked (see [`_mm_setcsr`](fn._mm_setcsr.html)).
 ///
 /// This corresponds to the `CVTTSS2SI` instruction (with 64 bit output).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvttss2si))]
 pub unsafe fn _mm_cvttss_si64(a: __m128) -> i64 {
@ -52,7 +52,7 @@ pub unsafe fn _mm_cvttss_si64(a: __m128) -> i64 {
 ///
 /// This intrinsic corresponds to the `CVTSI2SS` instruction (with 64 bit
 /// input).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse")]
 #[cfg_attr(test, assert_instr(cvtsi2ss))]
 pub unsafe fn _mm_cvtsi64_ss(a: __m128, b: i64) -> __m128 {
--- a/library/stdarch/coresimd/src/x86/x86_64/sse2.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/sse2.rs
@ -16,7 +16,7 @@ extern "C" {

 /// Convert the lower double-precision (64-bit) floating-point element in a to
 /// a 64-bit integer.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvtsd2si))]
 pub unsafe fn _mm_cvtsd_si64(a: __m128d) -> i64 {
@ -24,7 +24,7 @@ pub unsafe fn _mm_cvtsd_si64(a: __m128d) -> i64 {
 }

 /// Alias for [`_mm_cvtsd_si64`](fn._mm_cvtsd_si64_ss.html).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvtsd2si))]
 pub unsafe fn _mm_cvtsd_si64x(a: __m128d) -> i64 {
@ -33,7 +33,7 @@ pub unsafe fn _mm_cvtsd_si64x(a: __m128d) -> i64 {

 /// Convert the lower double-precision (64-bit) floating-point element in `a`
 /// to a 64-bit integer with truncation.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvttsd2si))]
 pub unsafe fn _mm_cvttsd_si64(a: __m128d) -> i64 {
@ -41,7 +41,7 @@ pub unsafe fn _mm_cvttsd_si64(a: __m128d) -> i64 {
 }

 /// Alias for [`_mm_cvttsd_si64`](fn._mm_cvttsd_si64_ss.html).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvttsd2si))]
 pub unsafe fn _mm_cvttsd_si64x(a: __m128d) -> i64 {
@ -51,7 +51,7 @@ pub unsafe fn _mm_cvttsd_si64x(a: __m128d) -> i64 {
 /// Stores a 64-bit integer value in the specified memory location.
 /// To minimize caching, the data is flagged as non-temporal (unlikely to be
 /// used again soon).
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(movnti))]
 pub unsafe fn _mm_stream_si64(mem_addr: *mut i64, a: i64) {
@ -60,7 +60,7 @@ pub unsafe fn _mm_stream_si64(mem_addr: *mut i64, a: i64) {

 /// Return a vector whose lowest element is `a` and all higher elements are
 /// `0`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(all(test, not(windows)), assert_instr(movq))]
 pub unsafe fn _mm_cvtsi64_si128(a: i64) -> __m128i {
@ -69,7 +69,7 @@ pub unsafe fn _mm_cvtsi64_si128(a: i64) -> __m128i {

 /// Return a vector whose lowest element is `a` and all higher elements are
 /// `0`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(all(test, not(windows)), assert_instr(movq))]
 pub unsafe fn _mm_cvtsi64x_si128(a: i64) -> __m128i {
@ -77,7 +77,7 @@ pub unsafe fn _mm_cvtsi64x_si128(a: i64) -> __m128i {
 }

 /// Return the lowest element of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(all(test, not(windows)), assert_instr(movq))]
 pub unsafe fn _mm_cvtsi128_si64(a: __m128i) -> i64 {
@ -85,7 +85,7 @@ pub unsafe fn _mm_cvtsi128_si64(a: __m128i) -> i64 {
 }

 /// Return the lowest element of `a`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(all(test, not(windows)), assert_instr(movq))]
 pub unsafe fn _mm_cvtsi128_si64x(a: __m128i) -> i64 {
@ -94,7 +94,7 @@ pub unsafe fn _mm_cvtsi128_si64x(a: __m128i) -> i64 {

 /// Return `a` with its lower element replaced by `b` after converting it to
 /// an `f64`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvtsi2sd))]
 pub unsafe fn _mm_cvtsi64_sd(a: __m128d, b: i64) -> __m128d {
@ -103,7 +103,7 @@ pub unsafe fn _mm_cvtsi64_sd(a: __m128d, b: i64) -> __m128d {

 /// Return `a` with its lower element replaced by `b` after converting it to
 /// an `f64`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse2")]
 #[cfg_attr(test, assert_instr(cvtsi2sd))]
 pub unsafe fn _mm_cvtsi64x_sd(a: __m128d, b: i64) -> __m128d {
--- a/library/stdarch/coresimd/src/x86/x86_64/sse41.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/sse41.rs
@ -9,7 +9,7 @@ use simd_llvm::*;
 use stdsimd_test::assert_instr;

 /// Extract an 64-bit integer from `a` selected with `imm8`
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 // TODO: Add test for Windows
 #[cfg_attr(all(test, not(windows)), assert_instr(pextrq, imm8 = 1))]
@ -20,7 +20,7 @@ pub unsafe fn _mm_extract_epi64(a: __m128i, imm8: i32) -> i64 {

 /// Return a copy of `a` with the 64-bit integer from `i` inserted at a
 /// location specified by `imm8`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.1")]
 #[cfg_attr(test, assert_instr(pinsrq, imm8 = 0))]
 pub unsafe fn _mm_insert_epi64(a: __m128i, i: i64, imm8: i32) -> __m128i {
--- a/library/stdarch/coresimd/src/x86/x86_64/sse42.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/sse42.rs
@ -11,7 +11,7 @@ extern "C" {

 /// Starting with the initial value in `crc`, return the accumulated
 /// CRC32 value for unsigned 64-bit integer `v`.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "sse4.2")]
 #[cfg_attr(test, assert_instr(crc32))]
 pub unsafe fn _mm_crc32_u64(crc: u64, v: u64) -> u64 {
--- a/library/stdarch/coresimd/src/x86/x86_64/xsave.rs
+++ b/library/stdarch/coresimd/src/x86/x86_64/xsave.rs
@ -29,7 +29,7 @@ extern "C" {
 ///
 /// The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of
 /// Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xsave64))]
 pub unsafe fn _xsave64(mem_addr: *mut u8, save_mask: u64) {
@ -42,7 +42,7 @@ pub unsafe fn _xsave64(mem_addr: *mut u8, save_mask: u64) {
 /// State is restored based on bits [62:0] in `rs_mask`, `XCR0`, and
 /// `mem_addr.HEADER.XSTATE_BV`. `mem_addr` must be aligned on a 64-byte
 /// boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave")]
 #[cfg_attr(test, assert_instr(xrstor64))]
 pub unsafe fn _xrstor64(mem_addr: *const u8, rs_mask: u64) {
@ -56,7 +56,7 @@ pub unsafe fn _xrstor64(mem_addr: *const u8, rs_mask: u64) {
 /// `mem_addr` must be aligned on a 64-byte boundary. The hardware may optimize
 /// the manner in which data is saved. The performance of this instruction will
 /// be equal to or better than using the `XSAVE64` instruction.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaveopt")]
 #[cfg_attr(test, assert_instr(xsaveopt64))]
 pub unsafe fn _xsaveopt64(mem_addr: *mut u8, save_mask: u64) {
@ -69,7 +69,7 @@ pub unsafe fn _xsaveopt64(mem_addr: *mut u8, save_mask: u64) {
 /// `xsavec` differs from `xsave` in that it uses compaction and that it may
 /// use init optimization. State is saved based on bits [62:0] in `save_mask`
 /// and `XCR0`. `mem_addr` must be aligned on a 64-byte boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsavec")]
 #[cfg_attr(test, assert_instr(xsavec64))]
 pub unsafe fn _xsavec64(mem_addr: *mut u8, save_mask: u64) {
@ -83,7 +83,7 @@ pub unsafe fn _xsavec64(mem_addr: *mut u8, save_mask: u64) {
 /// corresponding to bits set in `IA32_XSS` `MSR` and that it may use the
 /// modified optimization. State is saved based on bits [62:0] in `save_mask`
 /// and `XCR0`. `mem_addr` must be aligned on a 64-byte boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaves")]
 #[cfg_attr(test, assert_instr(xsaves64))]
 pub unsafe fn _xsaves64(mem_addr: *mut u8, save_mask: u64) {
@ -99,7 +99,7 @@ pub unsafe fn _xsaves64(mem_addr: *mut u8, save_mask: u64) {
 /// State is restored based on bits [62:0] in `rs_mask`, `XCR0`, and
 /// `mem_addr.HEADER.XSTATE_BV`. `mem_addr` must be aligned on a 64-byte
 /// boundary.
-#[inline(always)]
+#[inline]
 #[target_feature(enable = "xsave,xsaves")]
 #[cfg_attr(test, assert_instr(xrstors64))]
 pub unsafe fn _xrstors64(mem_addr: *const u8, rs_mask: u64) {