Python string decode

5/8/2023

You have to communicate this info out-of-band. In general, what character encoding to use is not embedded in the byte sequence itself. The data is corrupted but your program remains unaware that a failure If you use a wrong incompatible encoding: > '-'.encode('utf-8').decode('cp1252') The decoding may fail silently and produce mojibake Trying to decode such byte soup using utf-8 encoding raises UnicodeDecodeError. On Unix may be any sequence of bytes except slash b'/' and zeroī'\0': > open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close() Ls command may produce output that can't be interpreted as text. To interpret a byte sequence as a text, you have to know theĬorresponding character encoding: unicode_text = code(character_encoding) Lines.append(code('utf-8', 'slashescape')) #print err, dir(err), err.start, err.end, err.objectĬodecs.register_error('slashescape', slashescape) returnĪ tuple with a replacement for the unencodable part of the inputĪnd a position where encoding should continue""" It should be slower than the cp437 solution, but it should produce identical results on every Python version. UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. See Python’s Unicode Support for details. Lines.append(code('utf-8', 'backslashreplace')) That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions: PY3K = sys.version_info >= (3, 0) UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, -> ->, to validate both performance and reliability. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range. The same applies to latin-1, which was popular (the default?) for Python 2. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding: PY3K = sys.version_info >= (3, 0)īecause encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).ĭecoding arbitrary binary input to UTF-8 is unsafe, because you may get this: > b'\x00\x01\xffsd'.decode('utf-8')

0 Comments

Python string decode

Leave a Reply.

Author

Archives

Categories