Crate widestring

source ·
Expand description

A wide string library for converting to and from wide string variants.

This library provides multiple types of wide strings, each corresponding to a string types in the Rust standard library. Utf16String and Utf32String are analogous to the standard String type, providing a similar interface, and are always encoded as valid UTF-16 and UTF-32, respectively. They are the only type in this library that can losslessly and infallibly convert to and from String, and are the easiest type to work with. They are not designed for working with FFI, but do support efficient conversions from the FFI types.

U16String and U32String, on the other hand, are similar to (but not the same as), OsString, and are designed around working with FFI. Unlike the UTF variants, these strings do not have a defined encoding, and can work with any wide character strings, regardless of the encoding. They can be converted to and from OsString (but may require an encoding conversion depending on the platform), although that string type is an OS-specified encoding, so take special care.

U16String and U32String also allow access and mutation that relies on the user to enforce any constraints on the data. Some methods do assume a UTF encoding, but do so in a way that handles malformed encoding data. For FFI, use U16String or U32String when you simply need to pass-through string data, or when you’re not dealing with a nul-terminated data.

Finally, U16CString and U32CString are wide version of the standard CString type. Like U16String and U32String, they do not have defined encoding, but are designed to work with FFI, particularly C-style nul-terminated wide string data. These C-style strings are always terminated in a nul value, and are guaranteed to contain no interior nul values (unless unchecked methods are used). Again, these types may contain ill-formed encoding data, and methods handle it appropriately. Use U16CString or U32CString anytime you must properly handle nul values for when dealing with wide string C FFI.

Like the standard Rust string types, each wide string type has its corresponding wide string slice type, as shown in the following table:

All the string types in this library can be converted between string types of the same bit width, as well as appropriate standard Rust types, but be lossy and/or require knowledge of the underlying encoding. The UTF strings additionally can be converted between the two sizes of string, re-encoding the strings.

§Wide string literals

Macros are provided for each wide string slice type that convert standard Rust str literals into UTF-16 or UTF-32 encoded versions of the slice type at compile time.

use widestring::u16str;
let hello = u16str!("Hello, world!"); // `hello` will be a &U16Str value

These can be used anywhere a const function can be used, and provide a convenient method of specifying wide string literals instead of coding values by hand. The resulting string slices are always valid UTF encoding, and the u16cstr! and u32cstr! macros are automatically nul-terminated.

§Cargo features

This crate supports no_std when default cargo features are disabled. The std and alloc cargo features (enabled by default) enable the owned string types: U16String, U32String, U16CString, U32CString, Utf16String, and Utf32String types and their modules. Other types such as the string slices do not require allocation and can be used in a no_std environment, even without the alloc crate.

§Remarks on UTF-16 and UTF-32

UTF-16 encoding is a variable-length encoding. The 16-bit code units can specificy Unicode code points either as single units or in surrogate pairs. Because every value might be part of a surrogate pair, many regular string operations on UTF-16 data, including indexing, writing, or even iterating, require considering either one or two values at a time. This library provides safe methods for these operations when the data is known to be UTF-16, such as with Utf16String. In those cases, keep in mind that the number of elements (len()) of the wide string is not equivalent to the number of Unicode code points in the string, but is instead the number of code unit values.

For U16String and U16CString, which do not define an encoding, these same operations (indexing, mutating, iterating) do not take into account UTF-16 encoding and may result in sequences that are ill-formed UTF-16. Some methods are provided that do make an exception to this and treat the strings as malformed UTF-16, which are specified in their documentation as to how they handle the invalid data.

UTF-32 simply encodes Unicode code points as-is in 32-bit Unicode Scalar Values, but Unicode character code points are reserved only for 21-bits, and UTF-16 surrogates are invalid in UTF-32. Since UTF-32 is a fixed-width encoding, it is much easier to deal with, but equivalent methods to the 16-bit strings are provided for compatibility.

All the 32-bit wide strings provide efficient methods to convert to and from sequences of char data, as the representation of UTF-32 strings is functionally equivalent to sequences of chars. Keep in mind that only Utf32String guaruntees this equivalence, however, since the other strings may contain invalid values.

§FFI with C/C++ wchar_t

C/C++’s wchar_t (and C++’s corresponding widestring) varies in size depending on compiler and platform. Typically, wchar_t is 16-bits on Windows and 32-bits on most Unix-based platforms. For convenience when using wchar_t-based FFI’s, type aliases for the corresponding string types are provided: WideString aliases U16String on Windows or U32String elsewhere, WideCString aliases U16CString or U32CString, and WideUtfString aliases Utf16String or Utf32String. WideStr, WideCStr, and WideUtfStr are provided for the string slice types. The WideChar alias is also provided, aliasing u16 or u32 depending on platform.

When not interacting with a FFI that uses wchar_t, it is recommended to use the string types directly rather than via the wide alias.

§Nul values

This crate uses the term legacy ASCII term “nul” to refer to Unicode code point U+0000 NULL and its associated code unit representation as zero-value bytes. This is to disambiguate this zero value from null pointer values. C-style strings end in a nul value, while regular Rust strings allow interior nul values and are not terminated with nul.

§Examples

The following example uses U16String to get Windows error messages, since FormatMessageW returns a string length for us and we don’t need to pass error messages into other FFI functions so we don’t need to worry about nul values.

use winapi::um::winbase::{FormatMessageW, LocalFree, FORMAT_MESSAGE_FROM_SYSTEM,
                          FORMAT_MESSAGE_ALLOCATE_BUFFER, FORMAT_MESSAGE_IGNORE_INSERTS};
use winapi::shared::ntdef::LPWSTR;
use winapi::shared::minwindef::HLOCAL;
use std::ptr;
use widestring::U16String;

let s: U16String;
unsafe {
    // First, get a string buffer from some windows api such as FormatMessageW...
    let mut buffer: LPWSTR = ptr::null_mut();
    let strlen = FormatMessageW(FORMAT_MESSAGE_FROM_SYSTEM |
                                FORMAT_MESSAGE_ALLOCATE_BUFFER |
                                FORMAT_MESSAGE_IGNORE_INSERTS,
                                ptr::null(),
                                error_code, // error code from GetLastError()
                                0,
                                (&mut buffer as *mut LPWSTR) as LPWSTR,
                                0,
                                ptr::null_mut());

    // Get the buffer as a wide string
    s = U16String::from_ptr(buffer, strlen as usize);
    // Since U16String creates an owned copy, it's safe to free original buffer now
    // If you didn't want an owned copy, you could use &U16Str.
    LocalFree(buffer as HLOCAL);
}
// Convert to a regular Rust String and use it to your heart's desire!
let message = s.to_string_lossy();

The following example is the functionally the same, only using U16CString instead.

use winapi::um::winbase::{FormatMessageW, LocalFree, FORMAT_MESSAGE_FROM_SYSTEM,
                          FORMAT_MESSAGE_ALLOCATE_BUFFER, FORMAT_MESSAGE_IGNORE_INSERTS};
use winapi::shared::ntdef::LPWSTR;
use winapi::shared::minwindef::HLOCAL;
use std::ptr;
use widestring::U16CString;

let s: U16CString;
unsafe {
    // First, get a string buffer from some windows api such as FormatMessageW...
    let mut buffer: LPWSTR = ptr::null_mut();
    FormatMessageW(FORMAT_MESSAGE_FROM_SYSTEM |
                   FORMAT_MESSAGE_ALLOCATE_BUFFER |
                   FORMAT_MESSAGE_IGNORE_INSERTS,
                   ptr::null(),
                   error_code, // error code from GetLastError()
                   0,
                   (&mut buffer as *mut LPWSTR) as LPWSTR,
                   0,
                   ptr::null_mut());

    // Get the buffer as a wide string
    s = U16CString::from_ptr_str(buffer);
    // Since U16CString creates an owned copy, it's safe to free original buffer now
    // If you didn't want an owned copy, you could use &U16CStr.
    LocalFree(buffer as HLOCAL);
}
// Convert to a regular Rust String and use it to your heart's desire!
let message = s.to_string_lossy();

Re-exports§

Modules§

  • Errors returned by functions in this crate.
  • Iterators for encoding and decoding slices of string data.
  • C-style wide string slices.
  • C-style owned, growable wide strings.
  • Wide string slices with undefined encoding.
  • Owned, growable wide strings with undefined encoding.
  • UTF string slices.
  • Owned, growable UTF strings.

Macros§

Functions§

  • Creates an iterator over the UTF-16 encoded code points in iter, returning unpaired surrogates as Errs.
  • Creates a lossy decoder iterator over the possibly ill-formed UTF-16 encoded code points in iter.
  • Creates a decoder iterator over UTF-32 encoded code points in iter, returning invalid values as Errs.
  • Creates a lossy decoder iterator over the possibly ill-formed UTF-32 encoded code points in iter.
  • Creates an iterator that encodes an iterator over chars into UTF-8 bytes.
  • Creates an iterator that encodes an iterator over chars into UTF-16 u16 code units.
  • Creates an iterator that encodes an iterator over chars into UTF-32 u32 values.

Type Aliases§

  • Alias for u16 or u32 depending on platform. Intended to match typical C wchar_t size on platform.