[ACCEPTED]-How to detect UTF-8 characters in a Latin1 encoded column - MySQL-latin1

Accepted answer
Score: 56

Character encoding, like time zones, is 13 a constant source of problems.

What you can 12 do is look for any "high-ASCII" characters 11 as these are either LATIN1 accented characters 10 or symbols, or the first of a UTF-8 multi-byte 9 character. Telling the difference isn't 8 going to be easy unless you cheat a bit.

To 7 figure out what encoding is correct, you 6 just SELECT two different versions and compare 5 visually. Here's an example:

FROM users 

This is made 4 unusually complicated because the MySQL 3 regexp engine seems to ignore things like 2 \x80 and makes it necessary to use the UNHEX() method 1 instead.

This produces results like this:

latin1                utf8
Björn                Björn
Score: 10

Since your question is not completely clear, let's 1 assume some scenarios:

  1. Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
  2. Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
  3. Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
Score: 2

There is a script on github to help with this sort of a thing.


Score: 0

I would create a dump of the database and 12 grep for all valid UTF8 sequences. Where 11 to take it from there depends on what you 10 get. There are multiple questions on SO 9 about identifying invalid UTF8; you can 8 basically just reverse the logic.

Edit: So basically, any 7 field consisting entirely of 7-bit ASCII 6 is safe, and any field containing an invalid 5 UTF-8 sequence can be assumed to be Latin-1. The 4 remaining data should be inspected - if 3 you are lucky, a handful of obvious substitutions 2 will fix the absolute majority (replace 1 ö with Latin-1 ö, etc).

More Related questions