[ACCEPTED]-Character size in Java vs. C-character
In Java characters are 16-bit and C they 23 are 8-bit.
A more general question is why 22 is this so?
To find out why you need to 21 look at history and come to conclusions/opinions on the 20 subject.
When C was developed in the USA, ASCII 19 was pretty standard there and you only really 18 needed 7-bits, but with 8 you could handle 17 some non-ASCII characters as well. It might 16 seem more than enough. Many text based 15 protocols like SMTP (email), XML and FIX, still 14 only use ASCII character. Email and XML 13 encode non ASCII characters. Binary files, sockets 12 and stream are still only 8-bit byte native.
BTW: C 11 can support wider characters, but that is 10 not plain
When Java was developed 16-bit 9 seemed like enough to support most languages. Since 8 then unicode has been extended to characters 7 above 65535 and Java has had to add support 6 for codepoints which is UTF-16 characters 5 and can be one or two 16-bit characters.
So 4 making a
byte a byte and
char an unsigned 16-bit 3 value made sense at the time.
BTW: If your 2 JVM supports
-XX:+UseCompressedStrings it can use bytes instead of 1 chars for Strings which only use 8-bit characters.
Because Java uses Unicode, C generally uses 4 ASCII by default.
There are various flavours 3 of Unicode encoding, but Java uses UTF-16, which 2 uses either one or two 16-bit code units per character. ASCII 1 always uses one byte per character.
The Java 2 platform uses the UTF-16 representation 2 in char arrays and in the String and StringBuffer 1 classes.
In 50 contrast C is an "ancient" language 49 that was invented decades before Java, when 48 Unicode was far from a thing. That's the 47 age of 7-bit ASCII and 8-bit EBCDIC, thus 46 C uses 8-bit char1 as that's enough for a
char variable to contain all basic characters. When coming to the 45 Unicode times, to refrain from breaking 44 old code they decided to introduce a different 43 character type to C90 which is
wchar_t. Again this 42 is the 90s when Unicode began its life. In 41 any cases
char must continue to have the old size because you still need to access 40 individual bytes even if you use wider characters 39 (Java has a separate
byte type for this purpose)
Of 38 course later the Unicode Consortium quickly 37 realized that 16 bits are not enough and 36 must fix it somehow. They widened the code-point 35 range by changing UCS-2 to UTF-16 to avoid breaking 34 old code that uses wide char and have Unicode as 33 a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately 32 it was too late and the old implementations that use 16-bit char got stuck
Later we saw the advent 31 of UTF-8, which proved to be far superior to 30 UTF-16 because it's independent of endianness, generally 29 takes up less space, and most importantly 28 it requires no changes in the standard C string functions. Most user functions that receive a 27
char* will continue to work without special Unicode 26 support
Unix systems are lucky because they 25 migrate to Unicode later when UTF-8 was 24 introduced, therefore continue to use 8-bit 23 char. OTOH all modern Win32 APIs work on 22 16-bit wchar_t by default because Windows 21 was also an early adopter of Unicode. As 20 a result .NET framework and C# also go the 19 same way by having char as a 16-bit type.
Talking 18 about
wchar_t, it was so unportable that both C 17 and C++ standards needed to introduce the 16 new character types
char32_t in their 2011 revisions
Both 15 C and C++ introduced fixed-size character 14 types
char32_tin the 2011 revisions of their 13 respective standards to provide unambiguous 12 representation of 16-bit and 32-bit Unicode 11 transformation formats, leaving wchar_t 10 implementation-defined
That said, most implementations 9 are working on improving the wide string 8 situation. Java experimented with compressed string in Java 7 6 and introduced compact strings in Java 9. Python is moving 6 to a more flexible internal representation compared to
wchar_t* in Python before 3.3. Firefox and 5 Chrome have separate internal 8-bit char representations 4 for simple strings. There are also discussions 3 on that for .NET framework. And more recently Windows 2 is gradually introducing UTF-8 support for the old ANSI APIs
1 Strictly speaking 1
char in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?
char is an UTF-16-encoded Unicode code 2 point while C uses ASCII encoding in most 1 of the cases.
More Related questions