Rollup merge of #147932 - thaliaarchi:utf8-osstring, r=tgross35

Create UTF-8 version of `OsStr`/`OsString`

Implement a UTF-8 version of `OsStr`/`OsString`, in addition to the existing bytes and WTF-8 platform-dependent encodings.

This is applicable for several platforms, but I've currently only implemented it for Motor OS:

- WASI uses Unicode paths, but currently reexports the Unix bytes-assuming `OsStrExt`/`OsStringExt` traits.
  - [wasi:filesystem](https://wa.dev/wasi:filesystem) APIs:
    > Paths are passed as interface-type `strings`, meaning they must consist of a sequence of Unicode Scalar Values (USVs). Some filesystems may contain paths which are not accessible by this API.
  - In [wasi-filesystem#17](https://github.com/WebAssembly/wasi-filesystem/issues/17#issuecomment-1430639353), it was decided that applications can use any Unicode transformation format, so we're free to use UTF-8 (and probably already do). This was chosen over specifically UTF-8 or an ad hoc encoding which preserves paths not representable in UTF-8.
      > The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings.
    >
    > This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it.
- As of Redox OS [0.7.0](https://www.redox-os.org/news/release-0.7.0/), "All paths are now required to be UTF-8, and the kernel enforces this". This appears to have been implemented in commit [d331f72f](d331f72f2a) (Use UTF-8 for all paths, 2021-02-14). Redox does not have `OsStrExt`/`OsStringExt`.
- Motor OS guarantees that its OS strings are UTF-8 in its [current `OsStrExt`/`OsStringExt` traits](a828ffcf5f/library/std/src/os/motor/ffi.rs), but they're still internally bytes like Unix.

This is an alternate approach to https://github.com/rust-lang/rust/pull/147797, which reuses the existing bytes `OsString` and relies on the safety properties of `from_encoded_bytes_unchecked`. Compared to that, this also gains efficiency from propagating the UTF-8 invariant to the whole type, as it never needs to test for UTF-8 validity.

Note that Motor OS currently does not build until https://github.com/rust-lang/rust/pull/147930 merges.

cc `@tgross35` (for earlier review)
cc `@alexcrichton,` `@rylev,` `@loganek` (for WASI)
cc `@lasiotus` (for Motor OS)
cc `@jackpot51` (for Redox OS)
This commit is contained in:
Matthias Krüger 2025-10-22 07:12:11 +02:00 committed by GitHub
commit aa65c31c18
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 347 additions and 8 deletions

View file

@ -3,35 +3,40 @@
use crate::ffi::{OsStr, OsString};
use crate::sealed::Sealed;
use crate::sys_common::{AsInner, IntoInner};
/// Motor OS-specific extensions to [`OsString`].
/// Motor OSspecific extensions to [`OsString`].
///
/// This trait is sealed: it cannot be implemented outside the standard library.
/// This is so that future additional methods are not breaking changes.
pub trait OsStringExt: Sealed {
/// Motor OS strings are utf-8, and thus just strings.
fn as_str(&self) -> &str;
/// Yields the underlying UTF-8 string of this [`OsString`].
///
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
fn into_string(self) -> String;
}
impl OsStringExt for OsString {
#[inline]
fn as_str(&self) -> &str {
self.to_str().unwrap()
fn into_string(self) -> String {
self.into_inner().inner
}
}
/// Motor OS-specific extensions to [`OsString`].
/// Motor OSspecific extensions to [`OsString`].
///
/// This trait is sealed: it cannot be implemented outside the standard library.
/// This is so that future additional methods are not breaking changes.
pub trait OsStrExt: Sealed {
/// Motor OS strings are utf-8, and thus just strings.
/// Gets the underlying UTF-8 string view of the [`OsStr`] slice.
///
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
fn as_str(&self) -> &str;
}
impl OsStrExt for OsStr {
#[inline]
fn as_str(&self) -> &str {
self.to_str().unwrap()
&self.as_inner().inner
}
}

View file

@ -5,6 +5,10 @@ cfg_select! {
mod wtf8;
pub use wtf8::{Buf, Slice};
}
any(target_os = "motor") => {
mod utf8;
pub use utf8::{Buf, Slice};
}
_ => {
mod bytes;
pub use bytes::{Buf, Slice};

View file

@ -0,0 +1,330 @@
//! An OsString/OsStr implementation that is guaranteed to be UTF-8.
use core::clone::CloneToUninit;
use crate::borrow::Cow;
use crate::collections::TryReserveError;
use crate::rc::Rc;
use crate::sync::Arc;
use crate::sys_common::{AsInner, FromInner, IntoInner};
use crate::{fmt, mem};
#[derive(Hash)]
#[repr(transparent)]
pub struct Buf {
pub inner: String,
}
#[repr(transparent)]
pub struct Slice {
pub inner: str,
}
impl IntoInner<String> for Buf {
fn into_inner(self) -> String {
self.inner
}
}
impl FromInner<String> for Buf {
fn from_inner(inner: String) -> Self {
Buf { inner }
}
}
impl AsInner<str> for Buf {
#[inline]
fn as_inner(&self) -> &str {
&self.inner
}
}
impl fmt::Debug for Buf {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
fmt::Debug::fmt(&self.inner, f)
}
}
impl fmt::Display for Buf {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
fmt::Display::fmt(&self.inner, f)
}
}
impl fmt::Debug for Slice {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
fmt::Debug::fmt(&self.inner, f)
}
}
impl fmt::Display for Slice {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
fmt::Display::fmt(&self.inner, f)
}
}
impl Clone for Buf {
#[inline]
fn clone(&self) -> Self {
Buf { inner: self.inner.clone() }
}
#[inline]
fn clone_from(&mut self, source: &Self) {
self.inner.clone_from(&source.inner)
}
}
impl Buf {
#[inline]
pub fn into_encoded_bytes(self) -> Vec<u8> {
self.inner.into_bytes()
}
#[inline]
pub unsafe fn from_encoded_bytes_unchecked(s: Vec<u8>) -> Self {
unsafe { Self { inner: String::from_utf8_unchecked(s) } }
}
#[inline]
pub fn into_string(self) -> Result<String, Buf> {
Ok(self.inner)
}
#[inline]
pub const fn from_string(s: String) -> Buf {
Buf { inner: s }
}
#[inline]
pub fn with_capacity(capacity: usize) -> Buf {
Buf { inner: String::with_capacity(capacity) }
}
#[inline]
pub fn clear(&mut self) {
self.inner.clear()
}
#[inline]
pub fn capacity(&self) -> usize {
self.inner.capacity()
}
#[inline]
pub fn push_slice(&mut self, s: &Slice) {
self.inner.push_str(&s.inner)
}
#[inline]
pub fn push_str(&mut self, s: &str) {
self.inner.push_str(s);
}
#[inline]
pub fn reserve(&mut self, additional: usize) {
self.inner.reserve(additional)
}
#[inline]
pub fn try_reserve(&mut self, additional: usize) -> Result<(), TryReserveError> {
self.inner.try_reserve(additional)
}
#[inline]
pub fn reserve_exact(&mut self, additional: usize) {
self.inner.reserve_exact(additional)
}
#[inline]
pub fn try_reserve_exact(&mut self, additional: usize) -> Result<(), TryReserveError> {
self.inner.try_reserve_exact(additional)
}
#[inline]
pub fn shrink_to_fit(&mut self) {
self.inner.shrink_to_fit()
}
#[inline]
pub fn shrink_to(&mut self, min_capacity: usize) {
self.inner.shrink_to(min_capacity)
}
#[inline]
pub fn as_slice(&self) -> &Slice {
Slice::from_str(&self.inner)
}
#[inline]
pub fn as_mut_slice(&mut self) -> &mut Slice {
Slice::from_mut_str(&mut self.inner)
}
#[inline]
pub fn leak<'a>(self) -> &'a mut Slice {
Slice::from_mut_str(self.inner.leak())
}
#[inline]
pub fn into_box(self) -> Box<Slice> {
unsafe { mem::transmute(self.inner.into_boxed_str()) }
}
#[inline]
pub fn from_box(boxed: Box<Slice>) -> Buf {
let inner: Box<str> = unsafe { mem::transmute(boxed) };
Buf { inner: inner.into_string() }
}
#[inline]
pub fn into_arc(&self) -> Arc<Slice> {
self.as_slice().into_arc()
}
#[inline]
pub fn into_rc(&self) -> Rc<Slice> {
self.as_slice().into_rc()
}
/// Provides plumbing to `Vec::truncate` without giving full mutable access
/// to the `Vec`.
///
/// # Safety
///
/// The length must be at an `OsStr` boundary, according to
/// `Slice::check_public_boundary`.
#[inline]
pub unsafe fn truncate_unchecked(&mut self, len: usize) {
self.inner.truncate(len);
}
/// Provides plumbing to `Vec::extend_from_slice` without giving full
/// mutable access to the `Vec`.
///
/// # Safety
///
/// The slice must be valid for the platform encoding (as described in
/// `OsStr::from_encoded_bytes_unchecked`). For this encoding, that means
/// `other` must be valid UTF-8.
#[inline]
pub unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
self.inner.push_str(unsafe { str::from_utf8_unchecked(other) });
}
}
impl Slice {
#[inline]
pub fn as_encoded_bytes(&self) -> &[u8] {
self.inner.as_bytes()
}
#[inline]
pub unsafe fn from_encoded_bytes_unchecked(s: &[u8]) -> &Slice {
Slice::from_str(unsafe { str::from_utf8_unchecked(s) })
}
#[track_caller]
#[inline]
pub fn check_public_boundary(&self, index: usize) {
if !self.inner.is_char_boundary(index) {
panic!("byte index {index} is not an OsStr boundary");
}
}
#[inline]
pub fn from_str(s: &str) -> &Slice {
// SAFETY: Slice is just a wrapper over str.
unsafe { mem::transmute(s) }
}
#[inline]
fn from_mut_str(s: &mut str) -> &mut Slice {
// SAFETY: Slice is just a wrapper over str.
unsafe { mem::transmute(s) }
}
#[inline]
pub fn to_str(&self) -> Result<&str, crate::str::Utf8Error> {
Ok(&self.inner)
}
#[inline]
pub fn to_string_lossy(&self) -> Cow<'_, str> {
Cow::Borrowed(&self.inner)
}
#[inline]
pub fn to_owned(&self) -> Buf {
Buf { inner: self.inner.to_owned() }
}
#[inline]
pub fn clone_into(&self, buf: &mut Buf) {
self.inner.clone_into(&mut buf.inner)
}
#[inline]
pub fn into_box(&self) -> Box<Slice> {
let boxed: Box<str> = self.inner.into();
unsafe { mem::transmute(boxed) }
}
#[inline]
pub fn empty_box() -> Box<Slice> {
let boxed: Box<str> = Default::default();
unsafe { mem::transmute(boxed) }
}
#[inline]
pub fn into_arc(&self) -> Arc<Slice> {
let arc: Arc<str> = Arc::from(&self.inner);
unsafe { Arc::from_raw(Arc::into_raw(arc) as *const Slice) }
}
#[inline]
pub fn into_rc(&self) -> Rc<Slice> {
let rc: Rc<str> = Rc::from(&self.inner);
unsafe { Rc::from_raw(Rc::into_raw(rc) as *const Slice) }
}
#[inline]
pub fn make_ascii_lowercase(&mut self) {
self.inner.make_ascii_lowercase()
}
#[inline]
pub fn make_ascii_uppercase(&mut self) {
self.inner.make_ascii_uppercase()
}
#[inline]
pub fn to_ascii_lowercase(&self) -> Buf {
Buf { inner: self.inner.to_ascii_lowercase() }
}
#[inline]
pub fn to_ascii_uppercase(&self) -> Buf {
Buf { inner: self.inner.to_ascii_uppercase() }
}
#[inline]
pub fn is_ascii(&self) -> bool {
self.inner.is_ascii()
}
#[inline]
pub fn eq_ignore_ascii_case(&self, other: &Self) -> bool {
self.inner.eq_ignore_ascii_case(&other.inner)
}
}
#[unstable(feature = "clone_to_uninit", issue = "126799")]
unsafe impl CloneToUninit for Slice {
#[inline]
#[cfg_attr(debug_assertions, track_caller)]
unsafe fn clone_to_uninit(&self, dst: *mut u8) {
// SAFETY: we're just a transparent wrapper around [u8]
unsafe { self.inner.clone_to_uninit(dst) }
}
}