[ACCEPTED]-Find non-ASCII characters in a text file and convert them to their Unicode equivalent-character-encoding

Accepted answer
Score: 13

Assuming your script does know the correct 5 encoding of your text snippet than that 4 should be the regular expression to find 3 all Non-ASCII charactres:


see here: https://stackoverflow.com/a/20890052/1144966 and 2 https://stackoverflow.com/a/8845398/1144966

Also, the base-R tools package provides 1 two functions to detect non-ASCII characters:

Score: 4

You need to know or at least guess the character 30 encoding of the data in order to be able 29 to convert it properly. So you should try 28 and find information about the origin and 27 format of the text file and make sure that 26 you read the file properly in your software.

For 25 example, “Ullerهkersvنgen” looks like a 24 Scandinavian name, with Scandinavian letters 23 in it, misinterpreted according to a wrong 22 character encoding assumption or as munged 21 by an incorrect character code conversion. The 20 first Arabic letter in it, “ه”, is U+0647 ARABIC 19 LETTER HEH. In the ISO-8859-6 encoding, it 18 is E7 (hex.); in windows-1256, it is E5. Since 17 Scandinavian text are normally represented 16 in ISO-8859-1 or windows-1252 (when Unicode 15 encodings are not used), it is natural to 14 check what E7 and E5 mean in them: “ç” and 13 “å”. For linguistic reasons, the latter 12 is much more probable here. The second Arabic 11 letter is “ن” U+0646 ARABIC LETTER NOON, which 10 is E4 in windows-1256. And in ISO-8859-1, E4 9 is “ä”. This makes perfect sense: the word 8 is “Ulleråkersvägen”, a real Swedish street 7 name (in Uppsala, at least).

Thus, the data 6 is probably ISO-8859-1 or windows-1252 (Windows 5 Latin 1) encoded text, incorrectly interpreted 4 as windows-1256 (Windows Arabic). No conversion 3 is needed; you just need to read the data as 2 windows-1252 encoded. (After reading, it 1 can of course be converted to another encoding.)

More Related questions