[ACCEPTED]-does (w)ifstream support different encodings-wifstream
C++ supports character encodings by means 36 of std::locale
and the facet std::codecvt
. The general idea is 35 that a locale
object describes the aspects of 34 the system that might vary from culture 33 to culture, (human) language to language. These 32 aspects are broken down into facet
s, which are 31 template arguments that define how localization-dependent 30 objects (include I/O streams) are constructed. When 29 you read from an istream
or write to a ostream
, the actual 28 writing of each character is filtered through 27 the locale's facets. The facets cover not 26 only encoding of Unicode types but such 25 varied features as how large numbers are 24 written (e.g. with commas or periods), currency, time, capitalization, and 23 a slew of other details.
However just because 22 the facilities exist to do encodings doesn't 21 mean the standard library actually handles 20 all encodings, nor does it make such code 19 simple to do right. Even such basic things 18 as the size of character you should be reading 17 into (let alone the encoding part) is difficult, as 16 wchar_t
can be too small (mangling your data), or 15 too large (wasting space), and the most 14 common compilers (e.g. Visual C++ and Gnu 13 C++) do differ on how big their implementation 12 is. So you generally need to find external 11 libraries to do the actual encoding.
- iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
- jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)
The 10 most straightforward example I can find 9 that covers all the bases, is from Boost's 8 UTF-8 codecvt facet, with an example that specifically tries 7 to encode UTF-8 (UCS4) for use by IO streams. It 6 looks like this, though I don't suggest 5 just copying it verbatim. It takes a little 4 more digging in the source to understand it (and I 3 don't claim to):
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
...
std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }
To understand more about 2 locales, and how they use facets (including 1 codecvt
), take a look at the following:
- Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
- Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
- Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
- Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.
ifstream
does not care about encoding of file. It 16 just reads chars(bytes) from file. wifstream
reads 15 wide bytes(wchar_t
), but it still doesn't know 14 anything about file encoding. wifstream
is good enough 13 for UCS-2 — fixed-length character encoding 12 for Unicode (each character represented 11 with two bytes).
You could use IBM ICU library 10 to deal with Unicode files.
The International 9 Component for Unicode (ICU) is a mature, portable 8 set of C/C++ and Java libraries for Unicode 7 support, software internationalization (I18N) and 6 globalization (G11N), giving applications 5 the same results on all platforms.
ICU is 4 released under a nonrestrictive open source 3 license that is suitable for use with both 2 commercial software and with other open 1 source or free software.
The design of wide character string and 7 wide character stream pre-dates UTF-8, UTF-16 6 and Unicode. If you want to get technical, the 5 standard string and the standard stream 4 don't necessarily operate on ASCII (it's 3 just that basically all computers out there 2 use ASCII; you could potentially have an 1 EBCDIC machine).
Raymond Chen once wrote a series illustrating how to work with different wide character stream/string types.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.