Unicode will have much better support in the
next version of the C++ standard (currently targeted for 2009) than it does
now. There will be two new character types to support UTF-16 and UTF-32
character representations, along with required support for these character
types in several of the current standard template classes. There will also be
functions for converting between multibyte sequences and these new character
types, and a template class that will simplify using codecvt
facets for converting between strings. XML addicts looking for SAX and DOM
won’t be satisfied with these additions, but these changes are intended
to provide tools for implementing more powerful libraries like those.
The two new types are std::char16_t
and
std::char32_t
. As their names suggest, they’re intended to
represent 16-bit and 32-bit characters. Formally, they’re distinct
unsigned integral types, with the same size and alignment as the types
std::uint_least16_t
and std::uint_least32_t
,
respectively (these uint
types are defined in the header
<cstdint>
, which is also new). Of course, you need to be
able to define character and string literals of these types, so the new
standard will introduce the modifiers 'u'
and 'U'
to
describe those literals. u'a'
is a character literal of type
std::char16_t
, U'a'
is a character literal of type
std::char32_t
, u"ab"
is a string literal of type
std::char16_t[3]
, and U"ab"
is a string literal of
type std::char32_t[3]
.
Whenever you think of character strings you ought to think of
std::basic_string
. The new standard will add the template
specializations std::char_traits<char16_t>
and
std::char_traits<char32_t>
, so you can create
std::basic_string<char16_t>
and
std::basic_string<char32_t>
objects. The latter two types
are expected to be used often, so the standard will also require typedef names
for them: std::u16string
and std::u32string
.
Internally, then, your program can (and should) use characters encoded in
UTF-16 or UTF-32. Externally, if you're required to use a particular character
encoding for your files, there are two new specializations of the locale facet
codecvt
to translate UTF-16 and UTF-32 to and from various
encodings. Their names are std::codecvt<char16_t, char,
mbstate_t>
and std::codecvt<char32_t, char,
mbstate_t>
. You can use them with std::fstream
objects
by creating an std::locale
object with the appropriate
codecvt
facet and imbuing your stream object with that locale
object. More specifically, though, if you need to read and write files encoded
in UTF-8 or UTF-16, you can use the facets std::codecvt_utf8
and
std::codecvt_utf16
, respectively.
If you need to convert between character encodings within your application,
instead of on input or output, use the new templates
std::wstring_convert
and std::wbuffer_convert
. Each
can be instantiated with a codecvt
facet.
wstring_convert
applies the conversion defined by its
codecvt
facet to convert between std::basic_string
objects holding different size characters. Similarly,
wbuffer_convert
converts between arrays of different size
characters.
November 12, 2007
Exceptions have been a hot topic in newsgroup
discussions over the past few days, mostly because of confusion over what the
word means. The C++ language definition gives it a narrow meaning: Code in your
program throws and catches exceptions, using the keywords throw
,
try
, and catch
. Specifications for floating-point
math give it a narrow but different meaning: Various floating-point operations
on invalid operands produce exceptions, which ordinarily result in special
values such as NaN and Infinity, and sometimes cause calls to trap handlers.
Informal usage gives it a rather broad meaning, including such things as dividing
an integer value by 0 and dereferencing a null pointer. To further confuse things,
some compilers turn coding errors in that third category into C++ exceptions.
In particular, most compilers for Windows followed Microsoft’s lead in
integrating the operating system’s structured exception handling into
C++ exception handling, with the result that dereferencing a null pointer
results in an exception that can be caught with catch(...)
.
Although Microsoft’s online documentation now warns you away from this
technique, many Windows programmers still think that this is how C++ exceptions
are supposed to work.
Code that catches exceptions that result from dereferencing a null pointer
is, obviously, not portable. But there’s a more serious problem: when
an exception can come from just about any point in your code, it’s much
more difficult to write your code so that it works correctly when an exception
occurs. Looking through your source code for throw
statements is
no longer sufficient. You have to assume that just about any statement might
throw an exception, and that means you can’t separate code that throws
exceptions from code that doesn’t.
Of course, a program that uses code that dereferences a null pointer probably won’t give correct results, so maybe it doesn’t really matter whether data structures get corrupted because of the null pointer itself or because the null pointer causes an exception that your code can’t deal with correctly. Still, dereferencing a null pointer is the result of a coding error. Most applications should react to coding errors by shutting down as gracefully as possible -- it rarely makes sense to try to continue execution after such an error. While it’s possible to limit your use of exceptions to forcing a program to shut down, by doing that you lose much of the power of exceptions.
In general, don’t use exceptions to manage flow control, and don't use them to signal coding errors. There are better ways of doing those things. Use exceptions for what they were designed to do: recovering from invalid data, signalling the absence of a critical resource, and reporting any other failure that’s beyond your control.
January 15, 2007
Sequences are the base that the STL is built on. A sequence is simply an ordered set of zero or more elements. Thus, in a non-empty sequence, every element except the first has a predecessor, and every element except the last has a successor. To examine all the elements in a non-empty sequence, start at the first element. If you're not at the last element, move to the next element and go back to the beginning of this sentence. When you get here, you’re finished.
When we write code, a sequence is represented by a pair of iterators. An iterator usually points to an element in a sequence or to an imaginary element that’s just past the end of a sequence. In the pair of iterators that designates a sequence, the first iterator points to the first element in the sequence and the second iterator points to the first element (real or imaginary) after the end of the sequence. For example:
sequence: 1 1 2 3 5 8 13 21 34 55 | | | | | | iterators: a b c d e f
The iterator a
points to the first element in the sequence,
and the iterator f
points to an imaginary element past the
end of the sequence. The rest of the iterators point to elements inside
the sequence. But now lets look at it the other way around: The pair of
iterators a
and f
represents the entire
sequence. Further, any two iterators that point into this sequence (in
the right order, of course) represent a subsequence of the entire
sequence. Thus, the pair of iterators b
and d
represents the subsequence 2 3 5 8 13
, and the pair of
iterators e
and f
represents the subsequence
55
.
If we have a pair of iterators named first
and
last
, we can talk about the half-open range
[first, last)
, where first
points to the first
element in a sequence and last
points to the first element
after the end of that sequence. To examine all the iterators in the
half-open range [first, last)
, start at first
.
If first
and last
are equal, you’re
done; otherwise, increment first
and go back to the
beginning of this sentence. When you get here, you’re finished.
That's what the usual code for looping through an iterator range does:
while (first != last)
++first;
Of course, walking through a range of iterators doesn’t accomplish much. You also need to do something with the elements of the underlying sequence. You do that by dereferencing the iterator, to see the object that it points to:
while (first != last)
cout << *first++ << '\n';
If the two iterators first
and last
are equal
on entry to either of the code snippets above, the code does nothing.
Thus, the sequence described by a pair of iterators that are equal is
empty.
November 15, 2006
Portability is not an absolute -- it’s a measure of how easy it is to move code from one platform to another. So it doesn’t mean much to say that code is or is not portable. Instead, we should talk about how to make code more portable. In general, this means:
But writing code that’s easy to port takes more than just a list of good ideas. You have to develop work habits that help you apply those ideas. That means occasionally trying to compile your code with compilers other than the one you use for development, so that you can find patterns in the way you write code that don’t work well with other compilers. It also means studying the language definition, the standard library documentation, and the documentation of any non-standard libraries that you are using, so that you can recognize things that won’t be easy to port and either change them or isolate them. And it means writing comprehensive tests to detect changes in behavior that may occur when you port the code. If you don’t know that something changed, you can’t fix it.
November 7, 2006
Copyright © 2007 by Roundhouse Consulting, Ltd. All rights reserved.