In our days we have to provide internationalization support in the programs. Among other issues, we should be ready to accept textual information in the extended character sets. The wchar_t data type was provided in the C++ to achieve this goal. I'd like to sum up what should be taken in account when using the wide characters standard strings.
I often see some people gets confused with wchar_t. Someone thinks that wchar_t is “Unicode”, and hopes that they don’t have to take care about encodings at all if using wstring. As it turned out, usually by “Unicode” they mean “UTF-16” encoding.
If you are one of them, please don’t make assumptions like this. Thinking about wchar_t as unicode is not portable.
wchar_t is not UTF-16. It’s true that under MSVC on Win32 sizeof(wchar_t) is equal to 2. But in the standard C++ strings each element should hold the whole character. Thus you can’t store surrogate pairs in the standard C++ wide string, so they can be only in the UCS-2 encoding. This is just a small subset of the whole Unicode character set.
Under the GCC on Linux-x86 wchar_t is 4-byte. And is only UCS-4. And the endiannes is not defined too. How will you port the software if all of your code assumes working with wchar_t as UTF-16 wide characters?
Definitelly, the wchar_t is not UTF-16. The wchar_t is not even Unicode. The C++ standard only guarantees that wchar_t is wide enough to hold any character from the character set, supported by the implementation. There are no mentions about concrete encoding. It can be ANSI, or UCS-2, or UCS-4, or even those SCU-128 encoding from the far future.
Ok, I’m giving up. The wchar_t characters are usually Unicode in the UCS-something encoding. With different width and endiannes on different systems. So in the general situation we can’t do any assumptions about it if we care about portability. They are just “wide-enough” characters.
I think it’s good to use std::wstring to store the textual data internally. I like the fact that some frameworks (wxWidgets i.e.) accepts wstring and handles it correctly. But in the cases we need to communicate with the external world, we need to have the ability to convert wstring to something more specified.
Portable conversion from internal wide-character representation to the specified “real-life” encoding is required. There are a lot’s of APIs accepting strings in specified encoding instead of implementation’s wchar_t. My recent example is about sqlite3 library. It supports true UTF-16 encoding under all supported platforms. But only UCS-2 two-byte wide wstring can be used directly. Conversion should be applied on other implementations (such as UCS-4 wstring under GCC should be converted into UTF-16).
The libiconv library, for example, fits good for this purpose. It allows to convert sequences of character from one encoding into another. And what is more, it has UCS-2-INTERNAL and UCS-4-INTERNAL definition, so we even don’t need to care about endiannes.
I’m often thinking about creating C++ wrapper over libiconv. It’d be good to have a library that supports converting between std::wstring, std::string and arrays of custom-encoded characters. It should know what encoding to use for wchar_t string under each supported implementation. If you heard something about this kind of library, please make me know.
Comments
I think it's not true.
Take a look at the following page.
http://msdn.microsoft.com/en-us/library/2dax2h36.aspx
It says characters that cannot be presented in a wide character can be represented in a Unicode pair with Unicode's surrogate feature. So, each of such characters uses more than one wide character for storing.