Rollup merge of #147932 - thaliaarchi:utf8-osstring, r=tgross35
Create UTF-8 version of `OsStr`/`OsString` Implement a UTF-8 version of `OsStr`/`OsString`, in addition to the existing bytes and WTF-8 platform-dependent encodings. This is applicable for several platforms, but I've currently only implemented it for Motor OS: - WASI uses Unicode paths, but currently reexports the Unix bytes-assuming `OsStrExt`/`OsStringExt` traits. - [wasi:filesystem](https://wa.dev/wasi:filesystem) APIs: > Paths are passed as interface-type `strings`, meaning they must consist of a sequence of Unicode Scalar Values (USVs). Some filesystems may contain paths which are not accessible by this API. - In [wasi-filesystem#17](https://github.com/WebAssembly/wasi-filesystem/issues/17#issuecomment-1430639353), it was decided that applications can use any Unicode transformation format, so we're free to use UTF-8 (and probably already do). This was chosen over specifically UTF-8 or an ad hoc encoding which preserves paths not representable in UTF-8. > The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings. > > This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it. - As of Redox OS [0.7.0](https://www.redox-os.org/news/release-0.7.0/), "All paths are now required to be UTF-8, and the kernel enforces this". This appears to have been implemented in commit [d331f72f](d331f72f2a) (Use UTF-8 for all paths, 2021-02-14). Redox does not have `OsStrExt`/`OsStringExt`. - Motor OS guarantees that its OS strings are UTF-8 in its [current `OsStrExt`/`OsStringExt` traits](a828ffcf5f/library/std/src/os/motor/ffi.rs), but they're still internally bytes like Unix. This is an alternate approach to https://github.com/rust-lang/rust/pull/147797, which reuses the existing bytes `OsString` and relies on the safety properties of `from_encoded_bytes_unchecked`. Compared to that, this also gains efficiency from propagating the UTF-8 invariant to the whole type, as it never needs to test for UTF-8 validity. Note that Motor OS currently does not build until https://github.com/rust-lang/rust/pull/147930 merges. cc `@tgross35` (for earlier review) cc `@alexcrichton,` `@rylev,` `@loganek` (for WASI) cc `@lasiotus` (for Motor OS) cc `@jackpot51` (for Redox OS)
This commit is contained in:
commit
aa65c31c18
3 changed files with 347 additions and 8 deletions
|
|
@ -3,35 +3,40 @@
|
|||
|
||||
use crate::ffi::{OsStr, OsString};
|
||||
use crate::sealed::Sealed;
|
||||
use crate::sys_common::{AsInner, IntoInner};
|
||||
|
||||
/// Motor OS-specific extensions to [`OsString`].
|
||||
/// Motor OS–specific extensions to [`OsString`].
|
||||
///
|
||||
/// This trait is sealed: it cannot be implemented outside the standard library.
|
||||
/// This is so that future additional methods are not breaking changes.
|
||||
pub trait OsStringExt: Sealed {
|
||||
/// Motor OS strings are utf-8, and thus just strings.
|
||||
fn as_str(&self) -> &str;
|
||||
/// Yields the underlying UTF-8 string of this [`OsString`].
|
||||
///
|
||||
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
|
||||
fn into_string(self) -> String;
|
||||
}
|
||||
|
||||
impl OsStringExt for OsString {
|
||||
#[inline]
|
||||
fn as_str(&self) -> &str {
|
||||
self.to_str().unwrap()
|
||||
fn into_string(self) -> String {
|
||||
self.into_inner().inner
|
||||
}
|
||||
}
|
||||
|
||||
/// Motor OS-specific extensions to [`OsString`].
|
||||
/// Motor OS–specific extensions to [`OsString`].
|
||||
///
|
||||
/// This trait is sealed: it cannot be implemented outside the standard library.
|
||||
/// This is so that future additional methods are not breaking changes.
|
||||
pub trait OsStrExt: Sealed {
|
||||
/// Motor OS strings are utf-8, and thus just strings.
|
||||
/// Gets the underlying UTF-8 string view of the [`OsStr`] slice.
|
||||
///
|
||||
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
|
||||
fn as_str(&self) -> &str;
|
||||
}
|
||||
|
||||
impl OsStrExt for OsStr {
|
||||
#[inline]
|
||||
fn as_str(&self) -> &str {
|
||||
self.to_str().unwrap()
|
||||
&self.as_inner().inner
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -5,6 +5,10 @@ cfg_select! {
|
|||
mod wtf8;
|
||||
pub use wtf8::{Buf, Slice};
|
||||
}
|
||||
any(target_os = "motor") => {
|
||||
mod utf8;
|
||||
pub use utf8::{Buf, Slice};
|
||||
}
|
||||
_ => {
|
||||
mod bytes;
|
||||
pub use bytes::{Buf, Slice};
|
||||
|
|
|
|||
330
library/std/src/sys/os_str/utf8.rs
Normal file
330
library/std/src/sys/os_str/utf8.rs
Normal file
|
|
@ -0,0 +1,330 @@
|
|||
//! An OsString/OsStr implementation that is guaranteed to be UTF-8.
|
||||
|
||||
use core::clone::CloneToUninit;
|
||||
|
||||
use crate::borrow::Cow;
|
||||
use crate::collections::TryReserveError;
|
||||
use crate::rc::Rc;
|
||||
use crate::sync::Arc;
|
||||
use crate::sys_common::{AsInner, FromInner, IntoInner};
|
||||
use crate::{fmt, mem};
|
||||
|
||||
#[derive(Hash)]
|
||||
#[repr(transparent)]
|
||||
pub struct Buf {
|
||||
pub inner: String,
|
||||
}
|
||||
|
||||
#[repr(transparent)]
|
||||
pub struct Slice {
|
||||
pub inner: str,
|
||||
}
|
||||
|
||||
impl IntoInner<String> for Buf {
|
||||
fn into_inner(self) -> String {
|
||||
self.inner
|
||||
}
|
||||
}
|
||||
|
||||
impl FromInner<String> for Buf {
|
||||
fn from_inner(inner: String) -> Self {
|
||||
Buf { inner }
|
||||
}
|
||||
}
|
||||
|
||||
impl AsInner<str> for Buf {
|
||||
#[inline]
|
||||
fn as_inner(&self) -> &str {
|
||||
&self.inner
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for Buf {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
fmt::Debug::fmt(&self.inner, f)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Buf {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
fmt::Display::fmt(&self.inner, f)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for Slice {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
fmt::Debug::fmt(&self.inner, f)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Slice {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
fmt::Display::fmt(&self.inner, f)
|
||||
}
|
||||
}
|
||||
|
||||
impl Clone for Buf {
|
||||
#[inline]
|
||||
fn clone(&self) -> Self {
|
||||
Buf { inner: self.inner.clone() }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn clone_from(&mut self, source: &Self) {
|
||||
self.inner.clone_from(&source.inner)
|
||||
}
|
||||
}
|
||||
|
||||
impl Buf {
|
||||
#[inline]
|
||||
pub fn into_encoded_bytes(self) -> Vec<u8> {
|
||||
self.inner.into_bytes()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub unsafe fn from_encoded_bytes_unchecked(s: Vec<u8>) -> Self {
|
||||
unsafe { Self { inner: String::from_utf8_unchecked(s) } }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_string(self) -> Result<String, Buf> {
|
||||
Ok(self.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub const fn from_string(s: String) -> Buf {
|
||||
Buf { inner: s }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn with_capacity(capacity: usize) -> Buf {
|
||||
Buf { inner: String::with_capacity(capacity) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn clear(&mut self) {
|
||||
self.inner.clear()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn capacity(&self) -> usize {
|
||||
self.inner.capacity()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn push_slice(&mut self, s: &Slice) {
|
||||
self.inner.push_str(&s.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn push_str(&mut self, s: &str) {
|
||||
self.inner.push_str(s);
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn reserve(&mut self, additional: usize) {
|
||||
self.inner.reserve(additional)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn try_reserve(&mut self, additional: usize) -> Result<(), TryReserveError> {
|
||||
self.inner.try_reserve(additional)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn reserve_exact(&mut self, additional: usize) {
|
||||
self.inner.reserve_exact(additional)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn try_reserve_exact(&mut self, additional: usize) -> Result<(), TryReserveError> {
|
||||
self.inner.try_reserve_exact(additional)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn shrink_to_fit(&mut self) {
|
||||
self.inner.shrink_to_fit()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn shrink_to(&mut self, min_capacity: usize) {
|
||||
self.inner.shrink_to(min_capacity)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn as_slice(&self) -> &Slice {
|
||||
Slice::from_str(&self.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn as_mut_slice(&mut self) -> &mut Slice {
|
||||
Slice::from_mut_str(&mut self.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn leak<'a>(self) -> &'a mut Slice {
|
||||
Slice::from_mut_str(self.inner.leak())
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_box(self) -> Box<Slice> {
|
||||
unsafe { mem::transmute(self.inner.into_boxed_str()) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn from_box(boxed: Box<Slice>) -> Buf {
|
||||
let inner: Box<str> = unsafe { mem::transmute(boxed) };
|
||||
Buf { inner: inner.into_string() }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_arc(&self) -> Arc<Slice> {
|
||||
self.as_slice().into_arc()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_rc(&self) -> Rc<Slice> {
|
||||
self.as_slice().into_rc()
|
||||
}
|
||||
|
||||
/// Provides plumbing to `Vec::truncate` without giving full mutable access
|
||||
/// to the `Vec`.
|
||||
///
|
||||
/// # Safety
|
||||
///
|
||||
/// The length must be at an `OsStr` boundary, according to
|
||||
/// `Slice::check_public_boundary`.
|
||||
#[inline]
|
||||
pub unsafe fn truncate_unchecked(&mut self, len: usize) {
|
||||
self.inner.truncate(len);
|
||||
}
|
||||
|
||||
/// Provides plumbing to `Vec::extend_from_slice` without giving full
|
||||
/// mutable access to the `Vec`.
|
||||
///
|
||||
/// # Safety
|
||||
///
|
||||
/// The slice must be valid for the platform encoding (as described in
|
||||
/// `OsStr::from_encoded_bytes_unchecked`). For this encoding, that means
|
||||
/// `other` must be valid UTF-8.
|
||||
#[inline]
|
||||
pub unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
|
||||
self.inner.push_str(unsafe { str::from_utf8_unchecked(other) });
|
||||
}
|
||||
}
|
||||
|
||||
impl Slice {
|
||||
#[inline]
|
||||
pub fn as_encoded_bytes(&self) -> &[u8] {
|
||||
self.inner.as_bytes()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub unsafe fn from_encoded_bytes_unchecked(s: &[u8]) -> &Slice {
|
||||
Slice::from_str(unsafe { str::from_utf8_unchecked(s) })
|
||||
}
|
||||
|
||||
#[track_caller]
|
||||
#[inline]
|
||||
pub fn check_public_boundary(&self, index: usize) {
|
||||
if !self.inner.is_char_boundary(index) {
|
||||
panic!("byte index {index} is not an OsStr boundary");
|
||||
}
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn from_str(s: &str) -> &Slice {
|
||||
// SAFETY: Slice is just a wrapper over str.
|
||||
unsafe { mem::transmute(s) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn from_mut_str(s: &mut str) -> &mut Slice {
|
||||
// SAFETY: Slice is just a wrapper over str.
|
||||
unsafe { mem::transmute(s) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn to_str(&self) -> Result<&str, crate::str::Utf8Error> {
|
||||
Ok(&self.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn to_string_lossy(&self) -> Cow<'_, str> {
|
||||
Cow::Borrowed(&self.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn to_owned(&self) -> Buf {
|
||||
Buf { inner: self.inner.to_owned() }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn clone_into(&self, buf: &mut Buf) {
|
||||
self.inner.clone_into(&mut buf.inner)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_box(&self) -> Box<Slice> {
|
||||
let boxed: Box<str> = self.inner.into();
|
||||
unsafe { mem::transmute(boxed) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn empty_box() -> Box<Slice> {
|
||||
let boxed: Box<str> = Default::default();
|
||||
unsafe { mem::transmute(boxed) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_arc(&self) -> Arc<Slice> {
|
||||
let arc: Arc<str> = Arc::from(&self.inner);
|
||||
unsafe { Arc::from_raw(Arc::into_raw(arc) as *const Slice) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn into_rc(&self) -> Rc<Slice> {
|
||||
let rc: Rc<str> = Rc::from(&self.inner);
|
||||
unsafe { Rc::from_raw(Rc::into_raw(rc) as *const Slice) }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn make_ascii_lowercase(&mut self) {
|
||||
self.inner.make_ascii_lowercase()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn make_ascii_uppercase(&mut self) {
|
||||
self.inner.make_ascii_uppercase()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn to_ascii_lowercase(&self) -> Buf {
|
||||
Buf { inner: self.inner.to_ascii_lowercase() }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn to_ascii_uppercase(&self) -> Buf {
|
||||
Buf { inner: self.inner.to_ascii_uppercase() }
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn is_ascii(&self) -> bool {
|
||||
self.inner.is_ascii()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn eq_ignore_ascii_case(&self, other: &Self) -> bool {
|
||||
self.inner.eq_ignore_ascii_case(&other.inner)
|
||||
}
|
||||
}
|
||||
|
||||
#[unstable(feature = "clone_to_uninit", issue = "126799")]
|
||||
unsafe impl CloneToUninit for Slice {
|
||||
#[inline]
|
||||
#[cfg_attr(debug_assertions, track_caller)]
|
||||
unsafe fn clone_to_uninit(&self, dst: *mut u8) {
|
||||
// SAFETY: we're just a transparent wrapper around [u8]
|
||||
unsafe { self.inner.clone_to_uninit(dst) }
|
||||
}
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue