Unicode
Unicode
As we could see, the main problem in those 8bit encoding is their limitation and incompatiblity with each other.
Also what about chinese/japanese hieroglyphs (few thoundands of each) or emojis like ☕, ⚜, 🌠 or ⛄?
That's why Unicode was born!
Unicode is an international standard to encode the characters of all the world’s languages, plus symbols from mathematics and other fields.
Unicode provides a unique number for every character, no matter what the platform, nomatter what the program, no matter what the language.
Unicode is a registry for all known characters. Current Unicode 14.0 (September 2022) contains 144,697
characters:
144,697
graphic characters163
format characters65
control characters
Unicode covering 159
modern and historic scripts, as well as multiple symbol sets and emoji.
Each Unicode symbol has unique name and codepoint (it's number). Codepoint included a plane (set of characters) and the number of this character in that plain.
Some examples:
Let's check our ґ
character:
or ł
:
or ☕:
String in Python 3 is a sequence of code points.
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Via builtin module unicodedata
it is possible to get the standardized name of Unicode character or resolve that name into a character:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Encoding and Decoding
So, let's summary previous section:
Unicode
is the registry of all known charactersUnicode character has a codepoint and generalized name
Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal).
So, Unicode is DB with all characters known to human beings. To write them in a file we can use encodings.
Unicode string needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a character encoding, or just an encoding. The opposite procedure is called decoding.
Encoding is a mapping from Unicode DB to the exact bytes used to represent characters.
There are lot of different encodings Python supports: official doc.
Note: In Python 3 "utf8" is the default encoding, so we can skip it in
encode
anddecode
methods.
encoding (transform Unicode into
bytes
)'Złoto dla tego Wiedźmina, szybko!'.encode('utf8')
decoding (transform
bytes
into Unicode)b'Wied\xc5\xbamin'.decode('utf8')
As we can see characters present in ASCII are shown as text right away, absent characters are shown in a hex form of bytes used to represent them.
Encoding:
str.encode(encoding='utf-8', errors='strict')
Encode the string using the codec registered for encoding.
errors
by default is 'strict' meaning that encoding errors raise a UnicodeEncodeError
. Other possible values are 'ignore'
, 'replace'
, 'xmlcharrefreplace'
and backslashreplace
.
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Decoding:
bytes.decode(encoding='utf-8', errors='strict')
Decode the bytes using the codec registered for encoding.
errors
is by default is 'strict' meaning that decoding errors raise a UnicodeDecodeError
. Other possible values are 'ignore'
and 'replace'
.
UTF-8
UTF-8 is dynamicly sized Unicode Transformation Format
UTF-8 is one of the best encodings made specifically to cover Unicode characters. It is the standard encoding of Unicode. That's why UTF-8 is the standard text encoding in Python, Linux, osX, modern Windows and Web.
UTF-8 is dynamic and can use not just 1 byte (8bit) but dynamically from 1 byte to 4 byte:
1
byte for ASCII2
bytes for most Latin-derived and Cyrillic languages3
bytes for the rest of the basic multilingual plane4
bytes for Asian languages, symbols and emojis
Few words on a hex format:
0x**
- representation of a number\x**
- representation in a string
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Let's check again our popular ł
character on Unicode Table.
We can see that this character can be represented in various Unicode-friendly encodings:
Notice that in UTF8
it is C5 82
meaning it should be coded as \xc5\x82
.
Let's check encoded strings with this character to verify the encoded bytes we supposed to find:
🪄 Code:
📟 Output:
Here we see that:
ż
was encoded with 2 bytes:\xc5\xbc
Ł
was encoded with 2 bytes:\xc5\x81
ł
was encoded with 2 bytes:\xc5\x82
Ukrainian characters also use 2 bytes symbols:
🪄 Code:
📟 Output:
Japanese katakana (word means "Python" - paisonu
), as we can each character takes 3
bytes:
🪄 Code:
📟 Output:
And lastly, emojis take 3-4
characters:
🪄 Code:
📟 Output:
Last updated