[ACCEPTED]-Handle wrongly encoded character in Python unicode string-character-encoding

Accepted answer
Score: 12

You have to convert your unicode string 4 into a standard string using some encoding 3 e.g. utf-8:

some_unicode_string.encode('utf-8')

Apart from that: this is a dupe 2 of

BeautifulSoup findall with class attribute- unicode encode error

and at least ten other related questions 1 on SO. Research first.

Score: 8

Your unicode string is fine:

>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

The problem 9 you see at the interactive prompt is that 8 the interpreter doesn't know what encoding 7 to use to output the string to your terminal, so 6 it falls back to the "ascii" codec -- but 5 that codec only knows how to deal with ASCII 4 characters. It works fine on my machine 3 (because sys.stdout.encoding is "UTF-8" for 2 me -- likely because something like my environment 1 variable settings differ from yours)

>>> print u'Gl\xfcck'
Glück
Score: 4

At the beginning of your code, just after 5 imports, add these 3 lines.

import sys  # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')

It will override 4 system default encoding (ascii) for the 3 course of your program.

Edit: You shouldn't 2 do this unless you are sure of the consequences, see 1 comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')

Score: 0

Do not str() cast to string what you've got from 3 model fields, as long as it is an unicode 2 string already. (oops I have totally missed 1 that it is not django-related)

Score: 0

I stumble upon this bug myself while processing 6 a file containing german words that I was 5 unaware it has been encoded in UTF-8. The 4 problem manifest itself when I start processing 3 words and some of them would't show the 2 decoding error.

# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40) 
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

I solve the error calling 1 the encode method on the string:

>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

More Related questions