The innermost loop of TwoWaySearcher checks the boundary of the haystack
vs position + needle.len(), and it checks the last byte of the needle
against the byteset.
If these two steps are combined by using the indexing of the last
needle byte's position as bounds check, the algorithm improves its
throughput. We improve the innermost loop by reducing the number of
instructions used, and elminating the panic case for the checked
indexing that was previously used.
Selected benchmarks from the external/workspace testsuite. Benchmarks
improve across the board.
```
before:
test bb_in_aa::twoway_find ... bench: 4,229 ns/iter (+/- 1,305) = 23646 MB/s
test bb_in_aa::twoway_rfind ... bench: 3,873 ns/iter (+/- 101) = 25819 MB/s
test short_1let_long::twoway_find ... bench: 7,075 ns/iter (+/- 29) = 360 MB/s
test short_1let_long::twoway_rfind ... bench: 6,640 ns/iter (+/- 79) = 384 MB/s
test short_2let_long::twoway_find ... bench: 3,823 ns/iter (+/- 16) = 667 MB/s
test short_2let_long::twoway_rfind ... bench: 3,774 ns/iter (+/- 44) = 675 MB/s
test short_3let_long::twoway_find ... bench: 3,582 ns/iter (+/- 47) = 712 MB/s
test short_3let_long::twoway_rfind ... bench: 3,616 ns/iter (+/- 34) = 705 MB/s
with this commit:
test bb_in_aa::twoway_find ... bench: 2,952 ns/iter (+/- 20) = 33875 MB/s
test bb_in_aa::twoway_rfind ... bench: 2,939 ns/iter (+/- 99) = 34025 MB/s
test short_1let_long::twoway_find ... bench: 4,593 ns/iter (+/- 4) = 555 MB/s
test short_1let_long::twoway_rfind ... bench: 4,592 ns/iter (+/- 76) = 555 MB/s
test short_2let_long::twoway_find ... bench: 2,804 ns/iter (+/- 3) = 909 MB/s
test short_2let_long::twoway_rfind ... bench: 2,807 ns/iter (+/- 40) = 908 MB/s
test short_3let_long::twoway_find ... bench: 3,105 ns/iter (+/- 120) = 821 MB/s
test short_3let_long::twoway_rfind ... bench: 3,019 ns/iter (+/- 50) = 844 MB/s
```
- `bb_in_aa`: fast skip due to byteset filter loop improves.
- 1/2/3let: Searches for 1, 2, or 3 ascii bytes improves.
Fix quadratic behavior in StrSearcher in reverse search with periodic
needles.
This commit adds the missing pieces for the "short period" case in
reverse search. The short case will show up when the needle is literally
periodic, for example "abababab".
Two way uses a "critical factorization" of the needle: x = u v.
Searching matches v first, if mismatch at character k, skip k forward.
Matching u, if mismatch, skip period(x) forward.
To avoid O(mn) behavior after mismatch in u, memorize the already
matched prefix.
The short period case requires that |u| < period(x).
For the reverse search we need to compute a different critical
factorization x = u' v' where |v'| < period(x), because we are searching
for the reversed needle. A short v' also benefits the algorithm in
general.
The reverse critical factorization is computed quickly by using the same
maximal suffix algorithm, but terminating as soon as we have a location
with local period equal to period(x).
This adds extra fields crit_pos_back and memory_back for the reverse
case. The new overhead for TwoWaySearcher::new is low, and additionally
I think the "short period" case is uncommon in many applications of
string search.
The maximal_suffix methods were updated in documentation and the
algorithms updated to not use !0 and wrapping add, variable left is now
1 larger, offset 1 smaller.
Use periodicity when computing byteset: in the periodic case, just
iterate over one period instead of the whole needle.
Example before (rfind) after (twoway_rfind) benchmark shows the removal
of quadratic behavior.
needle: "ab" * 100, haystack: ("bb" + "ab" * 100) * 100
```
test periodic::rfind ... bench: 1,926,595 ns/iter (+/- 11,390) = 10 MB/s
test periodic::twoway_rfind ... bench: 51,740 ns/iter (+/- 66) = 386 MB/s
```
There are still problems in both the design and implementation of this, so we don't want it landing in 1.2.
cc @arielb1 @nikomatsakis
cc #27364
r? @alexcrichton
The following APIs were all marked with a `#[stable]` tag:
* process::Child::id
* error::Error::is
* error::Error::downcast
* error::Error::downcast_ref
* error::Error::downcast_mut
* io::Error::get_ref
* io::Error::get_mut
* io::Error::into_inner
* hash::Hash::hash_slice
* hash::Hasher::write_{i,u}{8,16,32,64,size}
This isn't actually necessary any more with the advent of `$crate` and changes
in the compiler to expand macros to `::core::$foo` in the context of a
`#![no_std]` crate.
The libcore inner module was also trimmed down a bit to the bare bones.
I think this was just missed when `Send` and `Sync` were redone, since it seems odd to not be able to use things like `Arc<AtomicPtr>`. If it was intentional feel free to just close this.
I used another test as a template for writing mine, so I hope I got all the headers and stuff right.
This isn't actually necessary any more with the advent of `$crate` and changes
in the compiler to expand macros to `::core::$foo` in the context of a
`#![no_std]` crate.
The libcore inner module was also trimmed down a bit to the bare bones.
As described in the module documentation, the memory orderings in Rust
are the same with that of LLVM. However, the documentation for the
memory orderings enum says the memory orderings are the same of that of
C++. Note that they differ in that C++'s support the consume reads,
while LLVM's does not. Hence this commit fixes the bug in the
documentation for the enum.
As described in the module documentation, the memory orderings in Rust
are the same with that of LLVM. However, the documentation for the
memory orderings enum says the memory orderings are the same of that of
C++. Note that they differ in that C++'s support the consume reads,
while LLVM's does not. Hence this commit fixes the bug in the
documentation for the enum.
The following APIs were all marked with a `#[stable]` tag:
* process::Child::id
* error::Error::is
* error::Error::downcast
* error::Error::downcast_ref
* error::Error::downcast_mut
* io::Error::get_ref
* io::Error::get_mut
* io::Error::into_inner
* hash::Hash::hash_slice
* hash::Hasher::write_{i,u}{8,16,32,64,size}
I had to modify some tests : since `wtf8buf_show` and `wtf8_show` were doing the exact same thing, I repurposed `wtf8_show` to `wtf8buf_show_str` which ensures `Wtf8Buf` `Debug`-formats the same as `str`.
`write_str_escaped` might also be shared amongst other `fmt` but I just left it there within `Wtf8::fmt` for review.
Improve siphash performance for longer data
Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.
Also contains small tweaks that should benefit hashing short data too,
both the commit that removes a variable and the autovectorization of
the hash state initialization (in SipHash::reset).
Benchmarks show that hashing longer data benefits for the improved word loading.
Before (using benchmarks from the first commit in the PR):
The before benchmark is a bit noisy.
```
test hash::sip::bench_bytes_4 ... bench: 41 ns/iter (+/- 0) = 97 MB/s
test hash::sip::bench_bytes_7 ... bench: 49 ns/iter (+/- 2) = 142 MB/s
test hash::sip::bench_bytes_8 ... bench: 42 ns/iter (+/- 4) = 190 MB/s
test hash::sip::bench_bytes_a_16 ... bench: 57 ns/iter (+/- 14) = 280 MB/s
test hash::sip::bench_bytes_b_32 ... bench: 85 ns/iter (+/- 74) = 376 MB/s
test hash::sip::bench_bytes_c_128 ... bench: 278 ns/iter (+/- 33) = 460 MB/s
test hash::sip::bench_long_str ... bench: 825 ns/iter (+/- 103)
test hash::sip::bench_str_of_8_bytes ... bench: 151 ns/iter (+/- 66)
test hash::sip::bench_str_over_8_bytes ... bench: 59 ns/iter (+/- 3)
test hash::sip::bench_str_under_8_bytes ... bench: 47 ns/iter (+/- 56)
test hash::sip::bench_u32 ... bench: 39 ns/iter (+/- 93) = 205 MB/s
test hash::sip::bench_u32_keyed ... bench: 40 ns/iter (+/- 88) = 200 MB/s
test hash::sip::bench_u64 ... bench: 54 ns/iter (+/- 96) = 148 MB/s
```
After:
```
test hash::sip::bench_bytes_4 ... bench: 41 ns/iter (+/- 3) = 97 MB/s
test hash::sip::bench_bytes_7 ... bench: 48 ns/iter (+/- 0) = 145 MB/s
test hash::sip::bench_bytes_8 ... bench: 35 ns/iter (+/- 1) = 228 MB/s
test hash::sip::bench_bytes_a_16 ... bench: 45 ns/iter (+/- 1) = 355 MB/s
test hash::sip::bench_bytes_b_32 ... bench: 60 ns/iter (+/- 0) = 533 MB/s
test hash::sip::bench_bytes_c_128 ... bench: 161 ns/iter (+/- 5) = 795 MB/s
test hash::sip::bench_long_str ... bench: 514 ns/iter (+/- 5)
test hash::sip::bench_str_of_8_bytes ... bench: 44 ns/iter (+/- 0)
test hash::sip::bench_str_over_8_bytes ... bench: 51 ns/iter (+/- 0)
test hash::sip::bench_str_under_8_bytes ... bench: 52 ns/iter (+/- 6)
test hash::sip::bench_u32 ... bench: 40 ns/iter (+/- 2) = 200 MB/s
test hash::sip::bench_u32_keyed ... bench: 39 ns/iter (+/- 1) = 205 MB/s
test hash::sip::bench_u64 ... bench: 36 ns/iter (+/- 1) = 222 MB/s
```
Many of these have long since reached their stage of being obsolete, so this
commit starts the removal process for all of them. The unstable features that
were deprecated are:
* box_heap
* cmp_partial
* fs_time
* hash_default
* int_slice
* iter_min_max
* iter_reset_fuse
* iter_to_vec
* map_in_place
* move_from
* owned_ascii_ext
* page_size
* read_and_zero
* scan_state
* slice_chars
* slice_position_elem
* subslice_offset
Many of these have long since reached their stage of being obsolete, so this
commit starts the removal process for all of them. The unstable features that
were deprecated are:
* cmp_partial
* fs_time
* hash_default
* int_slice
* iter_min_max
* iter_reset_fuse
* iter_to_vec
* map_in_place
* move_from
* owned_ascii_ext
* page_size
* read_and_zero
* scan_state
* slice_chars
* slice_position_elem
* subslice_offset
If they are ordered v0, v2, v1, v3, the compiler can find just a few
simd optimizations itself.
The new optimization I could observe on x86-64 was using 128 bit
registers for the v = key ^ constant operations in new / reset.
Use `ptr::copy_nonoverlapping` (aka memcpy) to load an u64 from the
byte stream. This is correct for any alignment, and the compiler will
use the appropriate instruction to load the data.
Use unchecked indexing.
This results in a large improvement of throughput (hashed bytes
/ second) for long data. Maximum improvement benches at a 70% increase
in throughput for large values (> 256 bytes) but already values of 16
bytes or larger improve.
Introducing unchecked indexing is motivated to reach as good throughput
as possible. Using ptr::copy_nonoverlapping without unchecked indexing
would land the improvement some 20-30 pct units lower.
We use a debug assertion so that the test suite checks our use of
unchecked indexing.
Macro desugaring of `in PLACE { BLOCK }` into "simpler" expressions following the in-development "Placer" protocol.
Includes Placer API that one can override to integrate support for `in` into one's own type. (See [RFC 809].)
[RFC 809]: https://github.com/rust-lang/rfcs/blob/master/text/0809-box-and-in-for-stdlib.md
Part of #22181
Replaced PR #26180.
Turns on the `in PLACE { BLOCK }` syntax, while leaving in support for the old `box (PLACE) EXPR` syntax (since we need to support that at least until we have a snapshot with support for `in PLACE { BLOCK }`.
(Note that we are not 100% committed to the `in PLACE { BLOCK }` syntax. In particular I still want to play around with some other alternatives. Still, I want to get the fundamental framework for the protocol landed so we can play with implementing it for non `Box` types.)
----
Also, this PR leaves out support for desugaring-based `box EXPR`. We will hopefully land that in the future, but for the short term there are type-inference issues injected by that change that we want to resolve separately.
eq_slice_() used to be a common implementation for two function that
both called it, but of those only eq_slice() is left, so we can as well
directly inline the code.
This commit moves the IR files in the distribution, rust_try.ll,
rust_try_msvc_64.ll, and rust_try_msvc_32.ll into the compiler from the main
distribution. There's a few reasons for this change:
* LLVM changes its IR syntax from time to time, so it's very difficult to
have these files build across many LLVM versions simultaneously. We'll likely
want to retain this ability for quite some time into the future.
* The implementation of these files is closely tied to the compiler and runtime
itself, so it makes sense to fold it into a location which can do more
platform-specific checks for various implementation details (such as MSVC 32
vs 64-bit).
* This removes LLVM as a build-time dependency of the standard library. This may
end up becoming very useful if we move towards building the standard library
with Cargo.
In the immediate future, however, this commit should restore compatibility with
LLVM 3.5 and 3.6.
This commit moves the IR files in the distribution, rust_try.ll,
rust_try_msvc_64.ll, and rust_try_msvc_32.ll into the compiler from the main
distribution. There's a few reasons for this change:
* LLVM changes its IR syntax from time to time, so it's very difficult to
have these files build across many LLVM versions simultaneously. We'll likely
want to retain this ability for quite some time into the future.
* The implementation of these files is closely tied to the compiler and runtime
itself, so it makes sense to fold it into a location which can do more
platform-specific checks for various implementation details (such as MSVC 32
vs 64-bit).
* This removes LLVM as a build-time dependency of the standard library. This may
end up becoming very useful if we move towards building the standard library
with Cargo.
In the immediate future, however, this commit should restore compatibility with
LLVM 3.5 and 3.6.
eq_slice_() used to be a common implementation for two function that
both called it, but of those only eq_slice() is left, so we can as well
directly inline the code.
This makes the primitive descriptions on the front page read properly
as descriptions of types and not of the associated modules.
Having the primitive and module docs derived from the same source
causes problems, primarily that they can't contain hyperlinks
cross-referencing each other.
This crates dedicated private modules in `std` to document the
primitive types, then for all primitives that have a corresponding
module, puts hyperlinks in moth the primitive docs and the module docs
cross-linking each other.
This should help clear up confusion when readers find themselves on
the wrong page.
This also removes all the duplicate `#[doc(primitive)]` tags in various places (especially core), so the core docs will no longer attempt to document the primitives for now. Seems like an acceptable tradeoff to get some cleanup for std.
I'm being constantly bitten by the lack of this implementation.
I'm unsure if there's a reason to avoid these implementations though.
Since we have a "lossy" implementation for both Mutex and RWLock (RWLock {{ locked }}) I don't think there's a big reason for not having a Debug implementation for the atomic types, even if the user can't specify the ordering.
Fixes#23812 by stripping the decoration when desugaring macro doc comments into #[doc] attributes, and detects whether the attribute should be inner or outer style and outputs the appropriate token tree.