String API
Recently HelenOS switched to a new string API. It is non-standard, but in many ways similar to the ANSI C string API.
- Character Repertoire and Encoding
- String Metrics
- Functions Operating on Prefixes
- Encoding and Decoding a Character
- Output Buffers
- Function Reference
Character Repertoire and Encoding
HelenOS uses the Universal Character Set or UCS (as defined by ISO/IEC 10646) for representing characters throughout the system. A single character is represented as wchar_t
(32-bit). Normally all strings are represented in UTF-8 and null-terminated. A string is usually declared as char *
. The API also has limited support for strings that are not null-terimanted (or sub-strings).
There is also limited support for wide strings, usually declared as wchar_t *
. These are encoded in UTF-32 and null-terminated. Wide strings can represent exactly the same characters like normal strings. However, with UTF-8 each character is encoded as one or more bytes. With UTF-32, which is used for the wide strings, each character is encoded as exactly four bytes. Both wide characters and wide strings are encoded using natural byte order (little-endian on little-endian platforms, big-endian on big-endian platforms). The wide strings should never start with the "byte order mark" (BOM) character, the byte order is implicit.
Character and String Literals
In source code non-ASCII characters should only be used in character and string literals. Keep in mind that HelenOS source files are encoded in UTF-8, too. Non-ASCII character literals need to be written as L'x'
. String literals are written the usual way ("string"
) and wide-string literals are written as L"wide string"
.
String Metrics
Unlike with an 8-bit encoding, there is not a 1:1:1 mapping between bytes in memory, characters and display cells on a monospace display. Therefore, three different metrics are needed:
- Size is the number of bytes to which the string is encoded, excluding the null terminator.
- Length is the number of characters in the string (i.e. the number of times we need to call str_decode() or chr_encode()). Again the null terminator is not counted.
- Width is the number of display cells on a monospace display the string will be rendered to.
Encoding and Decoding a Character
- str_decode()
- chr_encode()
Well-formed Strings
A string is considered well formed if and only if it is null-terminated and consists only of complete and valid UTF-8-encoded characters (i.e. it can be decoded with str_decode()
without error). Unless stated otherwise, all strings passed to functions must be well-formed and all string functions produce well-formed strings.
Output Buffers
Whenever the user supplies an output buffer to a string function, they must also pass the size of this buffer to the function (it is always passed in the following argument). The buffer size must be greater than zero. The function will always fill the buffer with a well-formed string. If the string produced does not fit in the buffer, the function will only store as many (complete) characters as possible and add the null terminator.
This arrangement makes it much easier to avoid accidental buffer overruns. It does not mean, however, that you do not need to check your string sizes carefully. If your string did not fit, it will not cause memory corruption, but instead it will be cropped. This can also lead to a serious bug in some situations, although it will be hopefully easier to detect than memory corruption.
Function Reference
Some functions operate on string prefixes. These have a name like str_[n|l|w]op()
. Such a function only uses a prefix of the string limited by a metric, n
for size, l
for length, w
for width.
- str_size()
- wstr_size()
- str_lsize()
- wstr_lsize()
- str_length()
- wstr_length()
- str_nlength()
- wstr_nlength()
- str_cpy()
- str_ncpy()
- str_append()
- str_dup()
- wstr_nstr()
- str_chr()
- str_rchr()