Roundhouse Consulting, Ltd.

Unicode will have much better support in the next version of the C++ standard (currently targeted for 2009) than it does now. There will be two new character types to support UTF-16 and UTF-32 character representations, along with required support for these character types in several of the current standard template classes. There will also be functions for converting between multibyte sequences and these new character types, and a template class that will simplify using codecvt facets for converting between strings. XML addicts looking for SAX and DOM won’t be satisfied with these additions, but these changes are intended to provide tools for implementing more powerful libraries like those.

The two new types are std::char16_t and std::char32_t. As their names suggest, they’re intended to represent 16-bit and 32-bit characters. Formally, they’re distinct unsigned integral types, with the same size and alignment as the types std::uint_least16_t and std::uint_least32_t, respectively (these uint types are defined in the header <cstdint>, which is also new). Of course, you need to be able to define character and string literals of these types, so the new standard will introduce the modifiers 'u' and 'U' to describe those literals. u'a' is a character literal of type std::char16_t, U'a' is a character literal of type std::char32_t, u"ab" is a string literal of type std::char16_t[3], and U"ab" is a string literal of type std::char32_t[3].

Whenever you think of character strings you ought to think of std::basic_string. The new standard will add the template specializations std::char_traits<char16_t> and std::char_traits<char32_t>, so you can create std::basic_string<char16_t> and std::basic_string<char32_t> objects. The latter two types are expected to be used often, so the standard will also require typedef names for them: std::u16string and std::u32string.

Internally, then, your program can (and should) use characters encoded in UTF-16 or UTF-32. Externally, if you're required to use a particular character encoding for your files, there are two new specializations of the locale facet codecvt to translate UTF-16 and UTF-32 to and from various encodings. Their names are std::codecvt<char16_t, char, mbstate_t> and std::codecvt<char32_t, char, mbstate_t>. You can use them with std::fstream objects by creating an std::locale object with the appropriate codecvt facet and imbuing your stream object with that locale object. More specifically, though, if you need to read and write files encoded in UTF-8 or UTF-16, you can use the facets std::codecvt_utf8 and std::codecvt_utf16, respectively.

If you need to convert between character encodings within your application, instead of on input or output, use the new templates std::wstring_convert and std::wbuffer_convert. Each can be instantiated with a codecvt facet. wstring_convert applies the conversion defined by its codecvt facet to convert between std::basic_string objects holding different size characters. Similarly, wbuffer_convert converts between arrays of different size characters.

November 12, 2007

Exceptions have been a hot topic in newsgroup discussions over the past few days, mostly because of confusion over what the word means. The C++ language definition gives it a narrow meaning: Code in your program throws and catches exceptions, using the keywords throw, try, and catch. Specifications for floating-point math give it a narrow but different meaning: Various floating-point operations on invalid operands produce exceptions, which ordinarily result in special values such as NaN and Infinity, and sometimes cause calls to trap handlers. Informal usage gives it a rather broad meaning, including such things as dividing an integer value by 0 and dereferencing a null pointer. To further confuse things, some compilers turn coding errors in that third category into C++ exceptions.

In particular, most compilers for Windows followed Microsoft’s lead in integrating the operating system’s structured exception handling into C++ exception handling, with the result that dereferencing a null pointer results in an exception that can be caught with catch(...). Although Microsoft’s online documentation now warns you away from this technique, many Windows programmers still think that this is how C++ exceptions are supposed to work.

Code that catches exceptions that result from dereferencing a null pointer is, obviously, not portable. But there’s a more serious problem: when an exception can come from just about any point in your code, it’s much more difficult to write your code so that it works correctly when an exception occurs. Looking through your source code for throw statements is no longer sufficient. You have to assume that just about any statement might throw an exception, and that means you can’t separate code that throws exceptions from code that doesn’t.

Of course, a program that uses code that dereferences a null pointer probably won’t give correct results, so maybe it doesn’t really matter whether data structures get corrupted because of the null pointer itself or because the null pointer causes an exception that your code can’t deal with correctly. Still, dereferencing a null pointer is the result of a coding error. Most applications should react to coding errors by shutting down as gracefully as possible -- it rarely makes sense to try to continue execution after such an error. While it’s possible to limit your use of exceptions to forcing a program to shut down, by doing that you lose much of the power of exceptions.

In general, don’t use exceptions to manage flow control, and don't use them to signal coding errors. There are better ways of doing those things. Use exceptions for what they were designed to do: recovering from invalid data, signalling the absence of a critical resource, and reporting any other failure that’s beyond your control.

January 15, 2007

Sequences are the base that the STL is built on. A sequence is simply an ordered set of zero or more elements. Thus, in a non-empty sequence, every element except the first has a predecessor, and every element except the last has a successor. To examine all the elements in a non-empty sequence, start at the first element. If you're not at the last element, move to the next element and go back to the beginning of this sentence. When you get here, you’re finished.

When we write code, a sequence is represented by a pair of iterators. An iterator usually points to an element in a sequence or to an imaginary element that’s just past the end of a sequence. In the pair of iterators that designates a sequence, the first iterator points to the first element in the sequence and the second iterator points to the first element (real or imaginary) after the end of the sequence. For example:

 sequence: 1 1 2 3 5 8 13 21 34 55
           |   |       |  |     |  |
iterators: a   b       c  d     e  f

The iterator a points to the first element in the sequence, and the iterator f points to an imaginary element past the end of the sequence. The rest of the iterators point to elements inside the sequence. But now lets look at it the other way around: The pair of iterators a and f represents the entire sequence. Further, any two iterators that point into this sequence (in the right order, of course) represent a subsequence of the entire sequence. Thus, the pair of iterators b and d represents the subsequence 2 3 5 8 13, and the pair of iterators e and f represents the subsequence 55.

If we have a pair of iterators named first and last, we can talk about the half-open range [first, last), where first points to the first element in a sequence and last points to the first element after the end of that sequence. To examine all the iterators in the half-open range [first, last), start at first. If first and last are equal, you’re done; otherwise, increment first and go back to the beginning of this sentence. When you get here, you’re finished. That's what the usual code for looping through an iterator range does:

    while (first != last)
        ++first;

Of course, walking through a range of iterators doesn’t accomplish much. You also need to do something with the elements of the underlying sequence. You do that by dereferencing the iterator, to see the object that it points to:

    while (first != last)
        cout << *first++ << '\n';

If the two iterators first and last are equal on entry to either of the code snippets above, the code does nothing. Thus, the sequence described by a pair of iterators that are equal is empty.

November 15, 2006

Portability is not an absolute -- it’s a measure of how easy it is to move code from one platform to another. So it doesn’t mean much to say that code is or is not portable. Instead, we should talk about how to make code more portable. In general, this means:

favoring standard library utilities over non-standard ones
when using non-standard libraries, favoring those that are well documented and reliably maintained
favoring language facilities that are known to work well over a broad range of compilers
isolating code whose behavior could vary on different platforms

But writing code that’s easy to port takes more than just a list of good ideas. You have to develop work habits that help you apply those ideas. That means occasionally trying to compile your code with compilers other than the one you use for development, so that you can find patterns in the way you write code that don’t work well with other compilers. It also means studying the language definition, the standard library documentation, and the documentation of any non-standard libraries that you are using, so that you can recognize things that won’t be easy to port and either change them or isolate them. And it means writing comprehensive tests to detect changes in behavior that may occur when you port the code. If you don’t know that something changed, you can’t fix it.

November 7, 2006

Pete Becker’s

Bits and Pieces