[ACCEPTED]-Is there a Python library function which attempts to guess the character-encoding of some bytes?-invalid-characters
+1 for the chardet module (suggested by @insin
).
It is 3 not in the standard library, but you can 2 easily install it with the following command:
$ pip install chardet
>>> import chardet
>>> import urllib
>>> detect = lambda url: chardet.detect(urllib.urlopen(url).read())
>>> detect('http://stackoverflow.com')
{'confidence': 0.85663169917190185, 'encoding': 'ISO-8859-2'}
>>> detect('https://stackoverflow.com/questions/269060/is-there-a-python-lib')
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
See 1 Installing Pip if you don't have one.
As far as I can tell, the standard library 6 doesn't have a function, though it's not 5 too difficult to write one as suggested 4 above. I think the real thing I was looking 3 for was a way to decode a string and guarantee 2 that it wouldn't throw an exception. The 1 errors parameter to string.decode does that.
def decode(s, encodings=('ascii', 'utf8', 'latin1')):
for encoding in encodings:
try:
return s.decode(encoding)
except UnicodeDecodeError:
pass
return s.decode('ascii', 'ignore')
The best way to do this that I've found 3 is to iteratively try decoding a prospective 2 with each of the most common encodings inside 1 of a try except block.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.