[ACCEPTED]-UTF8 Encoding problem - With good examples-character-encoding
This may be a job for the mb_detect_encoding()
function.
In my 9 limited experience with it, it's not 100% reliable 8 when used as a generic "encoding sniffer" - It 7 checks for the presence of certain characters 6 and byte values to make an educated guess 5 - but in this narrow case (it'll need to 4 distinguish just between UTF-8 and ISO-8859-1 3 ) it should work.
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected encoding '.$enc."<br />";
echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";
?>
you may get incorrect results 2 for strings that do not contain special 1 characters, but that is not a problem.
I made a function that addresses all this 12 issues. It´s called Encoding::toUTF8().
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>
Output:
Original : France Télécom
Encoding::toUTF8 : France Télécom
Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications
You 11 dont need to know what the encoding of your 10 strings is as long as you know it is either 9 on Latin1 (iso 8859-1), Windows-1252 or 8 UTF8. The string can have a mix of them 7 too.
Encoding::toUTF8() will convert everything 6 to UTF8.
I did it because a service was giving 5 me a feed of data all messed up, mixing 4 UTF8 and Latin1 in the same string.
Usage:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip
I've 3 included another function, Encoding::fixUFT8(), wich 2 will fix every UTF8 string that looks garbled.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will 1 output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Another way, maybe faster and less unreliable:
echo (strlen($str)!==strlen(utf8_decode($str)))
? $str //is multibyte, leave as is
: utf8_encode($str); //encode
It 6 compares the length of the original string 5 and the utf8_decoded string. A string that 4 contains a multibyte-character, has a strlen 3 which differs from the similar singlebyte-encoded 2 strlen.
For example:
strlen('Télécom')
should return 7 in Latin1 1 and 9 in UTF8
I made these little 2 functions that work 12 well with UTF-8 and ISO-8859-1 detection 11 / conversion...
function detect_encoding($string)
{
//http://w3.org/International/questions/qa-forms-utf-8.html
if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
return 'UTF-8';
//If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
//if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}
function convert_encoding($string, $to_encoding, $from_encoding = '')
{
if ($from_encoding == '')
$from_encoding = detect_encoding($string);
if ($from_encoding == $to_encoding)
return $string;
return mb_convert_encoding($string, $to_encoding, $from_encoding);
}
If your database contains 10 strings in 2 different charsets, what I 9 would do instead of plaguing all your application 8 code with charset detection / conversion 7 is to writhe a "one shot" script that will 6 read all of your tables records and update 5 their strings to the correct format (I would 4 pick UTF-8 if I were you). This way your 3 code will be cleaner and simpler to maintain.
Just 2 loop records in every tables of your database 1 and convert strings like this:
//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');
I didn't try your samples here, but from 9 past experiences, there is a quick fix for 8 this. Right after database connection execute 7 the following query BEFORE running any other 6 queries:
SET NAMES UTF8;
This is SQL Standard compliant, and 5 works well with other databases, like Firebird 4 and PostgreSQL.
But remember, you need ensure 3 UTF-8 declarations on other spots too in 2 order to make your application works fine. Follow 1 a quick checklist.
- All files should be saved as UTF-8 (preferred without BOM [Byte Order Mask])
- Your HTTP Server should send the encoding header UTF-8. Use Firebug or Live HTTP Headers to inspect.
- If your server compress and/or tokenize the response, you may see header content as chunked or gzipped. This is not a problem if you save your files as UTF-8 and
- Declare encoding into HTML header, using proper meta tag.
- Over all application (sockets, file system, databases...) does not forget to flag up UTF-8 everytime you can. Making this when opening a database connection or so helps you to not need to encode/decode/debug all the time. Grab'em by root.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.