[ACCEPTED]-Encode byte[] as String-byte
You should absolutely use base64 or possibly hex. (Either 22 will work; base64 is more compact but harder 21 for humans to read.)
You claim "both variants 20 work perfectly" but that's actually not 19 true. If you use the first approach and 18 data
is not actually a valid UTF-8 sequence, you 17 will lose data. You're not trying to convert 16 UTF-8-encoded text into a String
, so don't write 15 code which tries to do so.
Using ISO-8859-1
as an encoding 14 will preserve all the data - but in very 13 many cases the string that is returned will 12 not be easily transported across other protocols. It 11 may very well contain unprintable control 10 characters, for example.
Only use the String(byte[], String)
constructor 9 when you've got inherently textual data, which you happen 8 to have in an encoded form (where the encoding 7 is specified as the second argument). For 6 anything else - music, video, images, encrypted 5 or compressed data, just for example - you 4 should use an approach which treats the 3 incoming data as "arbitrary binary data" and 2 finds a textual encoding of it... which 1 is precisely what base64 and hex do.
You can store a byte in a String, though 22 it's not a good idea. You can't use UTF-8 21 as this will mange the bytes but a faster 20 and more efficient way is to use ISO-8859-1 19 encoding or plain 8-bit. The simplest way 18 to do this is to use
String s1 = new String(data, 0);
or
String s1 = new String(data, "ISO-8859-1");
From UTF-8 on Wikipedia, As Jon Skeet 17 notes, these encodings are not valid under 16 the standard. Their behaviour in Java varies. DataInputStream 15 treats them as the same for the first three 14 version and the next two throw an exception. The 13 Charset decoder treats them as separate 12 characters silently.
00000000 is \0
11000000 10000000 is \0
11100000 10000000 10000000 is \0
11110000 10000000 10000000 10000000 is \0
11111000 10000000 10000000 10000000 10000000 is \0
11111100 10000000 10000000 10000000 10000000 10000000 is \0
This means if you see 11 \0 in you String, you have no way of knowing 10 for sure what the original byte[] values 9 were. DataOutputStream uses the second 8 option for compatibility with C which sees 7 \0 as a terminator.
BTW DataOutputStream 6 is not aware of code points so writes high 5 code point characters in UTF-16 and then 4 UTF-8 encoding.
0xFE and 0xFF are not valid 3 to appear in a character. Values 0x11000000+ can 2 only appear at the start of a character, not 1 inside a multi-byte character.
Confirmed the accepted answer with Java. To 5 repeat, UTF-8, UTF-16 do not preserve all 4 the byte values. ISO-8859-1 does preserve 3 all the byte values. But if the encoded 2 bytes is to be transported beyond the JVM, use 1 Base64.
@Test
public void testBase64() {
final byte[] original = enumerate();
final String encoded = Base64.encodeBase64String( original );
final byte[] decoded = Base64.decodeBase64( encoded );
assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}
@Test
public void testIso8859() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.ISO_8859_1 );
final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}
@Test
public void testUtf16() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_16 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}
@Test
public void testUtf8() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_8 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}
@Test
public void testEnumerate() {
final Set<Byte> byteSet = new HashSet<>();
final byte[] bytes = enumerate();
for ( byte b : bytes ) {
byteSet.add( b );
}
assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}
/**
* Enumerates all the byte values.
*/
private byte[] enumerate() {
final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
final byte[] bytes = new byte[length];
for ( int i = 0; i < length; i++ ) {
bytes[i] = (byte)(i + Byte.MIN_VALUE);
}
return bytes;
}
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.