Commit graph

62 commits

Author SHA1 Message Date
Markus Reiter
3628a8f326
Remove unneeded parentheses. 2025-03-08 12:56:00 +01:00
Markus Reiter
224dad154b
Fix formatting. 2025-03-08 12:47:40 +01:00
Markus Reiter
90ebc24607
Use intrinsics::assume instead of hint::assert_unchecked. 2025-03-07 20:19:12 +01:00
Markus Reiter
22725588d3
Never inline lookup_slow. 2025-03-07 20:17:52 +01:00
Markus Reiter
34ac75be28
Add second precondition for skip_search. 2025-03-06 21:38:39 +01:00
Markus Reiter
222adac953
Allow optimizing out panic_bounds_check in Unicode checks. 2025-03-06 21:38:39 +01:00
bjorn3
1fcae03369 Rustfmt 2025-02-08 22:12:13 +00:00
Boxy
22998f0785 update cfgs 2024-11-27 15:14:54 +00:00
Ralf Jung
eddab479fd stabilize const_unicode_case_lookup 2024-11-12 15:13:31 +01:00
bors
cf2b370ad0 Auto merge of #132500 - RalfJung:char-is-whitespace-const, r=jhpratt
make char::is_whitespace unstably const

I am adding this to the existing https://github.com/rust-lang/rust/issues/132241 feature gate, since `is_digit` and `is_whitespace` seem similar enough that one can group them together.
2024-11-06 04:07:32 +00:00
Matthias Krüger
b438a5cd2a
Rollup merge of #132499 - RalfJung:unicode_data.rs, r=tgross35
unicode_data.rs: show command for generating file

https://github.com/rust-lang/rust/pull/131647 made this an easily runnable tool, now we just have to mention that in the comment. :)

Fixes https://github.com/rust-lang/rust/issues/131640.
2024-11-03 12:08:51 +01:00
Ralf Jung
0804815e69 make char::is_whitespace unstably const 2024-11-02 10:17:16 +01:00
Ralf Jung
720d618b5f unicode_data.rs: show command for generating file 2024-11-02 10:06:52 +01:00
Ralf Jung
66351a6184 get rid of a whole bunch of unnecessary rustc_const_unstable attributes 2024-11-02 09:59:55 +01:00
Matthias Krüger
fb42a4581b
Rollup merge of #131647 - jieyouxu:unicode-table-generator, r=Mark-Simulacrum
Register `src/tools/unicode-table-generator` as a runnable tool

It seems like `src/tools/unicode-table-generator` is not currently managed by bootstrap. This PR wires it up with bootstrap as a runnable tool.

This tool seems to take two possible args:

1. (Mandatory) path to `library/core/src/unicode/unicode_data.rs`, and
2. (Optional) path to generate a test file.

I only passed the mandatory path to `unicode_data.rs` in bootstrap and didn't do anything about (2). I'm not sure about how this tool is supposed to be run.

`Cargo.lock` is modified because I renamed `unicode-table-generator`'s bin name to match the tool name, as bootstrap's tool running logic expects the bin name to be derived from the tool name.

I also added a triagebot message to remind to not manually edit the library source file and edit the tool then regenerate instead, but this should probably be a tidy check (if that's desirable then that can be in a follow-up PR, though may be overkill).

Helps with #131640 but does not close it because still no docs.

r? `@Mark-Simulacrum` (since I think you authored this tool?)
2024-10-20 16:54:09 +02:00
许杰友 Jieyou Xu (Joe)
75a9c86a77 unicode-table-generator: sync comments
These comments were updated on master but not through this tool, so the
comments in the tool became outdated. Sync the comments to stay
consistent.
2024-10-13 19:33:10 +08:00
许杰友 Jieyou Xu (Joe)
d21aa86c65 unicode-table-generator: match bin name with tool name
Bootstrap assumes that the binary name is the same as tool name, just
makes everyone's lives easier.
2024-10-13 19:14:06 +08:00
Ralf Jung
90e4f10f6c switch unicode-data back to 'static' 2024-10-13 11:53:06 +02:00
Michael Goulet
c682aa162b Reformat using the new identifier sorting from rustfmt 2024-09-22 19:11:29 -04:00
Nicholas Nethercote
84ac80f192 Reformat use declarations.
The previous commit updated `rustfmt.toml` appropriately. This commit is
the outcome of running `x fmt --all` with the new formatting options.
2024-07-29 08:26:52 +10:00
Arpad Borsos
488598c183
Add a lower bound check to unicode-table-generator output
This adds a dedicated check for the lower bound
(if it is outside of ASCII range) to the output of the `unicode-table-generator` tool.

This generalized the ASCII-only fast-path, but only for the `Grapheme_Extend` property for now,
as that is the only one with a lower bound outside of ASCII.
2024-04-20 10:16:45 +02:00
KaDiWa
ad2b34d0e3
remove some unneeded imports 2023-04-12 19:27:18 +02:00
Martin Gammelsæter
54f55efb9a Use hex literal for INDEX_MASK 2023-03-21 09:59:47 +01:00
Martin Gammelsæter
355e1dda1d Improve case mapping encoding scheme
The indices are encoded as `u32`s in the range of invalid `char`s, so
that we know that if any mapping fails to parse as a `char` we should
use the value for lookup in the multi-table.

This avoids the second binary search in cases where a multi-`char`
mapping is needed.

Idea from @nikic
2023-03-16 21:42:15 +01:00
Martin Gammelsæter
f9bd884385 Split unicode case LUTs in single and multi variants
The majority of char case replacements are single char replacements,
so storing them as [char; 3] wastes a lot of space.

This commit splits the replacement tables for both `to_lower` and
`to_upper` into two separate tables, one with single-character mappings
and one with multi-character mappings.

This reduces the binary size for programs using all of these tables
with roughly 24K bytes.
2023-03-16 12:34:04 +01:00
Martin Gammelsæter
8a4eb9e3a8 Skip serializing ascii chars in case LUTs
Since ascii chars are already handled by a special case in the
`to_lower` and `to_upper` functions, there's no need to waste space on
them in the LUTs.
2023-03-15 17:27:23 +01:00
Sage Mitchell
2b328ea5ee
Address feedback from PR #101401 2022-09-04 08:07:53 -07:00
Sage Mitchell
4a3e169da7
Make char::is_lowercase and char::is_uppercase const
Implements #101400.
2022-09-04 08:07:53 -07:00
bors
ce36e88256 Auto merge of #100497 - kadiwa4:remove_clone_into_iter, r=cjgillot
Avoid cloning a collection only to iterate over it

`@rustbot` label: +C-cleanup
2022-08-28 18:31:08 +00:00
Yuki Okushi
e31bedc9cf
Rollup merge of #100924 - est31:closure_to_fn_ptr, r=Mark-Simulacrum
Smaller improvements of tidy and the unicode generator
2022-08-27 13:14:19 +09:00
est31
754b3e7567 Change hint to correct path 2022-08-23 19:06:27 +02:00
est31
0a6af989f6 Simplify unicode_downloads.rs
Reduce duplication by moving fetching logic into a dedicated function.
2022-08-23 19:04:07 +02:00
KaDiWa
4eebcb9910
avoid cloning and then iterating 2022-08-13 16:16:52 +02:00
Bruce A. MacNaughton
5d048eb69d add #inline 2022-07-20 16:13:54 -07:00
Bruce A. MacNaughton
89ace470dc formatted 2022-07-19 18:03:18 -07:00
Bruce A. MacNaughton
d4819632e2 working updates 2022-07-19 17:35:19 -07:00
T-O-R-U-S
72a25d05bf Use implicit capture syntax in format_args
This updates the standard library's documentation to use the new syntax. The
documentation is worthwhile to update as it should be more idiomatic
(particularly for features like this, which are nice for users to get acquainted
with). The general codebase is likely more hassle than benefit to update: it'll
hurt git blame, and generally updates can be done by folks updating the code if
(and when) that makes things more readable with the new format.

A few places in the compiler and library code are updated (mostly just due to
already having been done when this commit was first authored).
2022-03-10 10:23:40 -05:00
Josh Stone
6b0b417299 Let unicode-table-generator fail gracefully for bitsets
The "Alphabetic" property in Unicode 14 grew too big for the bitset
representation, panicking "cannot pack 264 into 8 bits". However, we
were already choosing the skiplist for that anyway, so this doesn't need
to be a hard failure. That panic is now a returned `Err`, and then in
`emit_codepoints` we automatically defer to skiplist.
2021-10-06 17:35:49 -07:00
Josh Stone
e159d42a9a Redo #81358 in unicode-table-generator 2021-10-06 15:45:17 -07:00
Mark Rousskov
c746be2219 Migrate to 2021 2021-09-20 22:21:42 -04:00
Jade
3cf820e17d rfc3052: Remove authors field from Cargo manifests
Since RFC 3052 soft deprecated the authors field anyway, hiding it from
crates.io, docs.rs, and making Cargo not add it by default, and it is
not generally up to date/useful information, we should remove it from
crates in this repo.
2021-07-29 14:56:05 -07:00
Matthias Krüger
ba6b4274b5 unicode_table_generator: fix clippy::writeln_empty_string, clippy::useless_format, clippy:::for_kv_map 2020-08-24 00:43:50 +02:00
Izzy Swart
b809f453ca
Fix typo "biset" -> "bitset" 2020-08-06 16:13:29 -07:00
mark
2c31b45ae8 mv std libs to library/ 2020-07-27 19:51:13 -05:00
Lzu Tao
fff822fead Migrate to numeric associated consts 2020-06-10 01:35:47 +00:00
Pyfisch
7f4048c710 Store UNICODE_VERSION as a tuple
Remove the UnicodeVersion struct containing
major, minor and update fields and replace it with
a 3-tuple containing the version number.
As the value of each field is limited to 255
use u8 to store them.
2020-04-11 12:56:25 +02:00
Mark Rousskov
ad679a7f43 Update the documentation comment 2020-03-27 19:02:23 -04:00
Mark Rousskov
b6bc906004 Remove separate encoding for a single nonzero-mapping byte
In practice, for the two data sets that still use the bitset encoding (uppercase
and lowercase) this is not a significant win, so just drop it entirely. It costs
us about 5 bytes, and the complexity is nontrivial.
2020-03-27 19:02:23 -04:00
Mark Rousskov
9c1ceece20 Add skip list based implementation for smaller encoding
This arranges for the sparser sets (everything except lower and uppercase) to be
encoded in a significantly smaller context. However, it is also a performance
trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction
is deemed to be sufficiently important to merit this performance loss,
particularly as it is unlikely that this code is hot anywhere (and if it is,
paying the memory cost for a bitset that directly represents the data seems
worthwhile).

Alphabetic     : 1599 bytes     (- 937 bytes)
Case_Ignorable : 949 bytes      (- 822 bytes)
Cased          : 359 bytes      (- 429 bytes)
Cc             : 9 bytes        (-  15 bytes)
Grapheme_Extend: 813 bytes      (- 675 bytes)
Lowercase      : 863 bytes
N              : 419 bytes      (- 619 bytes)
Uppercase      : 776 bytes
White_Space    : 37 bytes       (-  46 bytes)
Total table sizes: 5824 bytes   (-3543 bytes)
2020-03-27 19:02:23 -04:00
Mark Rousskov
33b9e6f5cf Add richer printing 2020-03-24 16:24:47 -04:00