Crate widestring
source ·Expand description
A wide string library for converting to and from wide string variants.
This library provides multiple types of wide strings, each corresponding to a string types in
the Rust standard library. Utf16String
and Utf32String
are analogous to the standard
String
type, providing a similar interface, and are always encoded as valid UTF-16 and
UTF-32, respectively. They are the only type in this library that can losslessly and infallibly
convert to and from String
, and are the easiest type to work with. They are not designed for
working with FFI, but do support efficient conversions from the FFI types.
U16String
and U32String
, on the other hand, are similar to (but not the same as),
OsString
, and are designed around working with FFI. Unlike the UTF variants, these strings
do not have a defined encoding, and can work with any wide character strings, regardless of
the encoding. They can be converted to and from OsString
(but may require an encoding
conversion depending on the platform), although that string type is an OS-specified
encoding, so take special care.
U16String
and U32String
also allow access and mutation that relies on the user
to enforce any constraints on the data. Some methods do assume a UTF encoding, but do so in a
way that handles malformed encoding data. For FFI, use U16String
or U32String
when you
simply need to pass-through string data, or when you’re not dealing with a nul-terminated data.
Finally, U16CString
and U32CString
are wide version of the standard CString
type.
Like U16String
and U32String
, they do not have defined encoding, but are designed to
work with FFI, particularly C-style nul-terminated wide string data. These C-style strings are
always terminated in a nul value, and are guaranteed to contain no interior nul values (unless
unchecked methods are used). Again, these types may contain ill-formed encoding data, and
methods handle it appropriately. Use U16CString
or U32CString
anytime you must properly
handle nul values for when dealing with wide string C FFI.
Like the standard Rust string types, each wide string type has its corresponding wide string slice type, as shown in the following table:
String Type | Slice Type |
---|---|
Utf16String | Utf16Str |
Utf32String | Utf32Str |
U16String | U16Str |
U32String | U32Str |
U16CString | U16CStr |
U32CString | U32CStr |
All the string types in this library can be converted between string types of the same bit width, as well as appropriate standard Rust types, but be lossy and/or require knowledge of the underlying encoding. The UTF strings additionally can be converted between the two sizes of string, re-encoding the strings.
§Wide string literals
Macros are provided for each wide string slice type that convert standard Rust str
literals
into UTF-16 or UTF-32 encoded versions of the slice type at compile time.
use widestring::u16str;
let hello = u16str!("Hello, world!"); // `hello` will be a &U16Str value
These can be used anywhere a const
function can be used, and provide a convenient method of
specifying wide string literals instead of coding values by hand. The resulting string slices
are always valid UTF encoding, and the u16cstr!
and u32cstr!
macros are automatically
nul-terminated.
§Cargo features
This crate supports no_std
when default cargo features are disabled. The std
and alloc
cargo features (enabled by default) enable the owned string types: U16String
, U32String
,
U16CString
, U32CString
, Utf16String
, and Utf32String
types and their modules.
Other types such as the string slices do not require allocation and can be used in a no_std
environment, even without the alloc
crate.
§Remarks on UTF-16 and UTF-32
UTF-16 encoding is a variable-length encoding. The 16-bit code units can specificy Unicode code
points either as single units or in surrogate pairs. Because every value might be part of a
surrogate pair, many regular string operations on UTF-16 data, including indexing, writing, or
even iterating, require considering either one or two values at a time. This library provides
safe methods for these operations when the data is known to be UTF-16, such as with
Utf16String
. In those cases, keep in mind that the number of elements (len()
) of the
wide string is not equivalent to the number of Unicode code points in the string, but is
instead the number of code unit values.
For U16String
and U16CString
, which do not define an encoding, these same operations
(indexing, mutating, iterating) do not take into account UTF-16 encoding and may result in
sequences that are ill-formed UTF-16. Some methods are provided that do make an exception to
this and treat the strings as malformed UTF-16, which are specified in their documentation as to
how they handle the invalid data.
UTF-32 simply encodes Unicode code points as-is in 32-bit Unicode Scalar Values, but Unicode character code points are reserved only for 21-bits, and UTF-16 surrogates are invalid in UTF-32. Since UTF-32 is a fixed-width encoding, it is much easier to deal with, but equivalent methods to the 16-bit strings are provided for compatibility.
All the 32-bit wide strings provide efficient methods to convert to and from sequences of
char
data, as the representation of UTF-32 strings is functionally equivalent to sequences
of char
s. Keep in mind that only Utf32String
guaruntees this equivalence, however, since
the other strings may contain invalid values.
§FFI with C/C++ wchar_t
C/C++’s wchar_t
(and C++’s corresponding widestring
) varies in size depending on compiler
and platform. Typically, wchar_t
is 16-bits on Windows and 32-bits on most Unix-based
platforms. For convenience when using wchar_t
-based FFI’s, type aliases for the corresponding
string types are provided: WideString
aliases U16String
on Windows or U32String
elsewhere, WideCString
aliases U16CString
or U32CString
, and WideUtfString
aliases Utf16String
or Utf32String
. WideStr
, WideCStr
, and WideUtfStr
are
provided for the string slice types. The WideChar
alias is also provided, aliasing u16
or u32
depending on platform.
When not interacting with a FFI that uses wchar_t
, it is recommended to use the string types
directly rather than via the wide alias.
§Nul values
This crate uses the term legacy ASCII term “nul” to refer to Unicode code point U+0000 NULL
and its associated code unit representation as zero-value bytes. This is to disambiguate this
zero value from null pointer values. C-style strings end in a nul value, while regular Rust
strings allow interior nul values and are not terminated with nul.
§Examples
The following example uses U16String
to get Windows error messages, since FormatMessageW
returns a string length for us and we don’t need to pass error messages into other FFI
functions so we don’t need to worry about nul values.
use winapi::um::winbase::{FormatMessageW, LocalFree, FORMAT_MESSAGE_FROM_SYSTEM,
FORMAT_MESSAGE_ALLOCATE_BUFFER, FORMAT_MESSAGE_IGNORE_INSERTS};
use winapi::shared::ntdef::LPWSTR;
use winapi::shared::minwindef::HLOCAL;
use std::ptr;
use widestring::U16String;
let s: U16String;
unsafe {
// First, get a string buffer from some windows api such as FormatMessageW...
let mut buffer: LPWSTR = ptr::null_mut();
let strlen = FormatMessageW(FORMAT_MESSAGE_FROM_SYSTEM |
FORMAT_MESSAGE_ALLOCATE_BUFFER |
FORMAT_MESSAGE_IGNORE_INSERTS,
ptr::null(),
error_code, // error code from GetLastError()
0,
(&mut buffer as *mut LPWSTR) as LPWSTR,
0,
ptr::null_mut());
// Get the buffer as a wide string
s = U16String::from_ptr(buffer, strlen as usize);
// Since U16String creates an owned copy, it's safe to free original buffer now
// If you didn't want an owned copy, you could use &U16Str.
LocalFree(buffer as HLOCAL);
}
// Convert to a regular Rust String and use it to your heart's desire!
let message = s.to_string_lossy();
The following example is the functionally the same, only using U16CString
instead.
use winapi::um::winbase::{FormatMessageW, LocalFree, FORMAT_MESSAGE_FROM_SYSTEM,
FORMAT_MESSAGE_ALLOCATE_BUFFER, FORMAT_MESSAGE_IGNORE_INSERTS};
use winapi::shared::ntdef::LPWSTR;
use winapi::shared::minwindef::HLOCAL;
use std::ptr;
use widestring::U16CString;
let s: U16CString;
unsafe {
// First, get a string buffer from some windows api such as FormatMessageW...
let mut buffer: LPWSTR = ptr::null_mut();
FormatMessageW(FORMAT_MESSAGE_FROM_SYSTEM |
FORMAT_MESSAGE_ALLOCATE_BUFFER |
FORMAT_MESSAGE_IGNORE_INSERTS,
ptr::null(),
error_code, // error code from GetLastError()
0,
(&mut buffer as *mut LPWSTR) as LPWSTR,
0,
ptr::null_mut());
// Get the buffer as a wide string
s = U16CString::from_ptr_str(buffer);
// Since U16CString creates an owned copy, it's safe to free original buffer now
// If you didn't want an owned copy, you could use &U16CStr.
LocalFree(buffer as HLOCAL);
}
// Convert to a regular Rust String and use it to your heart's desire!
let message = s.to_string_lossy();
Re-exports§
pub use ucstr::U16CStr;
pub use ucstr::U32CStr;
pub use ucstr::WideCStr;
pub use ucstring::U16CString;
pub use ucstring::U32CString;
pub use ucstring::WideCString;
pub use ustr::U16Str;
pub use ustr::U32Str;
pub use ustr::WideStr;
pub use ustring::U16String;
pub use ustring::U32String;
pub use ustring::WideString;
pub use utfstr::Utf16Str;
pub use utfstr::Utf32Str;
pub use utfstr::WideUtfStr;
pub use utfstring::Utf16String;
pub use utfstring::Utf32String;
pub use utfstring::WideUtfString;
Modules§
- Errors returned by functions in this crate.
- Iterators for encoding and decoding slices of string data.
- C-style wide string slices.
- C-style owned, growable wide strings.
- Wide string slices with undefined encoding.
- Owned, growable wide strings with undefined encoding.
- UTF string slices.
- Owned, growable UTF strings.
Macros§
- Includes a UTF-16 encoded file as a
Utf16Str
. - Converts a string literal into a
const
UTF-16 string slice of typeU16CStr
. - Converts a string literal into a
const
UTF-16 string slice of typeU16Str
. - Converts a string literal into a
const
UTF-32 string slice of typeU32CStr
. - Converts a string literal into a
const
UTF-32 string slice of typeU32Str
. - Converts a string literal into a
const
UTF-16 string slice of typeUtf16Str
. - Converts a string literal into a
const
UTF-32 string slice of typeUtf32Str
. - Alias for
utf16str
orutf32str
macros depending on platform. Intended to be used when usingWideUtfStr
type alias.
Functions§
- Creates an iterator over the UTF-16 encoded code points in
iter
, returning unpaired surrogates asErr
s. - Creates a lossy decoder iterator over the possibly ill-formed UTF-16 encoded code points in
iter
. - Creates a decoder iterator over UTF-32 encoded code points in
iter
, returning invalid values asErr
s. - Creates a lossy decoder iterator over the possibly ill-formed UTF-32 encoded code points in
iter
. - Creates an iterator that encodes an iterator over
char
s into UTF-8 bytes.