[ACCEPTED]-does (w)ifstream support different encodings-wifstream

Accepted answer
Score: 24

C++ supports character encodings by means 36 of std::locale and the facet std::codecvt. The general idea is 35 that a locale object describes the aspects of 34 the system that might vary from culture 33 to culture, (human) language to language. These 32 aspects are broken down into facets, which are 31 template arguments that define how localization-dependent 30 objects (include I/O streams) are constructed. When 29 you read from an istream or write to a ostream, the actual 28 writing of each character is filtered through 27 the locale's facets. The facets cover not 26 only encoding of Unicode types but such 25 varied features as how large numbers are 24 written (e.g. with commas or periods), currency, time, capitalization, and 23 a slew of other details.

However just because 22 the facilities exist to do encodings doesn't 21 mean the standard library actually handles 20 all encodings, nor does it make such code 19 simple to do right. Even such basic things 18 as the size of character you should be reading 17 into (let alone the encoding part) is difficult, as 16 wchar_t can be too small (mangling your data), or 15 too large (wasting space), and the most 14 common compilers (e.g. Visual C++ and Gnu 13 C++) do differ on how big their implementation 12 is. So you generally need to find external 11 libraries to do the actual encoding.

  • iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
  • jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)

The 10 most straightforward example I can find 9 that covers all the bases, is from Boost's 8 UTF-8 codecvt facet, with an example that specifically tries 7 to encode UTF-8 (UCS4) for use by IO streams. It 6 looks like this, though I don't suggest 5 just copying it verbatim. It takes a little 4 more digging in the source to understand it (and I 3 don't claim to):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

To understand more about 2 locales, and how they use facets (including 1 codecvt), take a look at the following:

Score: 3

ifstream does not care about encoding of file. It 16 just reads chars(bytes) from file. wifstream reads 15 wide bytes(wchar_t), but it still doesn't know 14 anything about file encoding. wifstream is good enough 13 for UCS-2 — fixed-length character encoding 12 for Unicode (each character represented 11 with two bytes).

You could use IBM ICU library 10 to deal with Unicode files.

The International 9 Component for Unicode (ICU) is a mature, portable 8 set of C/C++ and Java libraries for Unicode 7 support, software internationalization (I18N) and 6 globalization (G11N), giving applications 5 the same results on all platforms.

ICU is 4 released under a nonrestrictive open source 3 license that is suitable for use with both 2 commercial software and with other open 1 source or free software.

Score: 0

The design of wide character string and 7 wide character stream pre-dates UTF-8, UTF-16 6 and Unicode. If you want to get technical, the 5 standard string and the standard stream 4 don't necessarily operate on ASCII (it's 3 just that basically all computers out there 2 use ASCII; you could potentially have an 1 EBCDIC machine).

Raymond Chen once wrote a series illustrating how to work with different wide character stream/string types.

More Related questions