[ACCEPTED]-How to convert UTF-8 to US-Ascii in Java-ascii
You can do this with the following (from 1 the NFD example in this Core Java Technology Tech Tip):
public static String decompose(String s) {
return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}
The uni2ascii program is written in C, but you could 10 probably convert it to Java with little 9 effort. It contains a large table of approximations 8 (implicitly, in the switch-case statements).
Be 7 aware that there are no universally accepted 6 approximations: Germans want you to replace 5 Ä by AE, Finns and Swedes prefer just A. Your 4 example of Å isn't obvious either: Swedes 3 would probably just drop the ring and use 2 A, but Danes and Norwegians might like the 1 historically more correct AA better.
Instead of creating your own table, you 10 could instead convert the text to normalization 9 form D, where the characters are represented 8 as a base character plus the diacritics 7 (for instance, "á" will be replaced 6 by "a" followed by a combining 5 acute accent). You can then strip everything 4 which is not an ASCII letter.
The tables 3 still exist, but are now the ones from the 2 Unicode standard.
You could also try NFKD 1 instead of NFD, to catch even more cases.
References:
In response to the answer given by Joe Liversedge, the referenced Lucene 1 ISOLatin1AccentFilter no longer exists :
It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter :
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.
FYI -
This is typically useful in search applications. See 4 the corresponding Lucene ISOLatin1AccentFilter implementation. This 3 isn't really designed for plugging into 2 a random local implementation, but does 1 the trick.
There are some built in functions to do 3 this. The main class involved is CharsetEncoder
, which 2 is part of the nio
package. A simpler way is 1 String.getBytes(Charset)
that can be sent to a ByteArrayOutputStream
.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.