[ACCEPTED]-UTF8 Encoding problem - With good examples-character-encoding

Accepted answer
Score: 30

This may be a job for the mb_detect_encoding() function.

In my 9 limited experience with it, it's not 100% reliable 8 when used as a generic "encoding sniffer" - It 7 checks for the presence of certain characters 6 and byte values to make an educated guess 5 - but in this narrow case (it'll need to 4 distinguish just between UTF-8 and ISO-8859-1 3 ) it should work.

<?php
$text = $entity['Entity']['title'];

echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");

echo 'Detected encoding '.$enc."<br />";

echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";

?>

you may get incorrect results 2 for strings that do not contain special 1 characters, but that is not a problem.

Score: 9

I made a function that addresses all this 12 issues. It´s called Encoding::toUTF8().

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>

Output:

Original : France Télécom
Encoding::toUTF8 : France Télécom

Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications

You 11 dont need to know what the encoding of your 10 strings is as long as you know it is either 9 on Latin1 (iso 8859-1), Windows-1252 or 8 UTF8. The string can have a mix of them 7 too.

Encoding::toUTF8() will convert everything 6 to UTF8.

I did it because a service was giving 5 me a feed of data all messed up, mixing 4 UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

I've 3 included another function, Encoding::fixUFT8(), wich 2 will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will 1 output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Score: 6

Another way, maybe faster and less unreliable:

echo (strlen($str)!==strlen(utf8_decode($str)))
  ? $str                //is multibyte, leave as is
  : utf8_encode($str);  //encode

It 6 compares the length of the original string 5 and the utf8_decoded string. A string that 4 contains a multibyte-character, has a strlen 3 which differs from the similar singlebyte-encoded 2 strlen.

For example:

strlen('Télécom') 

should return 7 in Latin1 1 and 9 in UTF8

Score: 1

I made these little 2 functions that work 12 well with UTF-8 and ISO-8859-1 detection 11 / conversion...

function detect_encoding($string)
{
    //http://w3.org/International/questions/qa-forms-utf-8.html
    if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
        return 'UTF-8';

    //If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
    //if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
    return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}

function convert_encoding($string, $to_encoding, $from_encoding = '')
{
    if ($from_encoding == '')
        $from_encoding = detect_encoding($string);

    if ($from_encoding == $to_encoding)
        return $string;

    return mb_convert_encoding($string, $to_encoding, $from_encoding);
}

If your database contains 10 strings in 2 different charsets, what I 9 would do instead of plaguing all your application 8 code with charset detection / conversion 7 is to writhe a "one shot" script that will 6 read all of your tables records and update 5 their strings to the correct format (I would 4 pick UTF-8 if I were you). This way your 3 code will be cleaner and simpler to maintain.

Just 2 loop records in every tables of your database 1 and convert strings like this:

//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');
Score: 0

I didn't try your samples here, but from 9 past experiences, there is a quick fix for 8 this. Right after database connection execute 7 the following query BEFORE running any other 6 queries:

SET NAMES UTF8;

This is SQL Standard compliant, and 5 works well with other databases, like Firebird 4 and PostgreSQL.

But remember, you need ensure 3 UTF-8 declarations on other spots too in 2 order to make your application works fine. Follow 1 a quick checklist.

  • All files should be saved as UTF-8 (preferred without BOM [Byte Order Mask])
  • Your HTTP Server should send the encoding header UTF-8. Use Firebug or Live HTTP Headers to inspect.
  • If your server compress and/or tokenize the response, you may see header content as chunked or gzipped. This is not a problem if you save your files as UTF-8 and
  • Declare encoding into HTML header, using proper meta tag.
  • Over all application (sockets, file system, databases...) does not forget to flag up UTF-8 everytime you can. Making this when opening a database connection or so helps you to not need to encode/decode/debug all the time. Grab'em by root.

More Related questions