Python Book
🇺🇦 Stand with Ukraine🎓Training Suite
  • Book overview
  • Notes about this book
  • 1. Introduction to Python
    • What is Python
    • Basic syntax
    • Objects in Python
    • Python overview
    • Installation, IDEs etc.
    • ipython
    • Sources for self-learning
  • 2. Strings and numbers
    • Getting help
    • Introspection
    • Basic types
    • None object
    • Numbers
    • Strings
    • Unicode
    • String Formatting
    • Regular expressions
    • Sources for self-learning
  • 3. Containers
    • Data Structures
    • Lists
    • Tuples
    • Dictionaries
    • Sets
    • Conditions
    • Loops
    • Additional modules
    • Sources for self-learning
  • 4. Functions
    • Functions
    • Scopes of visibility
    • Generators
    • Lambdas
    • Type hints
    • Function internals
    • Sources for self-learning
  • 5. Functional Programming
    • Builtins
    • Iterable
    • Iterator
    • Functional Programming
    • Functools
    • Comprehensions
    • Additional modules
    • Sources for self-learning
  • 6. Code Styling
    • Zen of Python
    • Lint
    • PEP 8
    • Modules
    • Packages
    • Sources for self-learning
  • 7. OOP
    • OOP Basics
    • Code design principles
    • Classes
    • Method Resolution Order
    • Magic attributes and methods
    • Super
    • Sources for self-learning
  • 8. Decorators, Exceptions
    • Decorators
    • Exceptions
    • Sources for self-learning
  • 9. Testing
    • Basic Terminology
    • Testing theory
    • Dev unit testing vs QA automated testing
    • Best Practices
    • Doctest
    • Unittest
    • Test Runners
    • Pytest
    • Nose
    • Continuous Integration
  • 10. System Libs
    • Working with files
    • System libraries
    • Subprocess
    • Additional CLI libraries
Powered by GitBook
On this page

Was this helpful?

Edit on Git
  1. 2. Strings and numbers

Unicode

PreviousStringsNextString Formatting

Last updated 2 years ago

Was this helpful?

Unicode

As we could see, the main problem in those 8bit encoding is their limitation and incompatiblity with each other.

Also what about chinese/japanese hieroglyphs (few thoundands of each) or emojis like ☕, ⚜, 🌠 or ⛄?

That's why Unicode was born!

is an international standard to encode the characters of all the world’s languages, plus symbols from mathematics and other fields.

Unicode provides a unique number for every character, no matter what the platform, nomatter what the program, no matter what the language.

Unicode is a registry for all known characters. Current Unicode 14.0 (September 2022) contains 144,697 characters:

  • 144,697 graphic characters

  • 163 format characters

  • 65 control characters

Unicode covering 159 modern and historic scripts, as well as multiple symbol sets and emoji.

Each Unicode symbol has unique name and codepoint (it's number). Codepoint included a plane (set of characters) and the number of this character in that plain.

Some examples:

0061   'a'; LATIN SMALL LETTER A
...
007B   '{'; LEFT CURLY BRACKET
...
265F   '♟'; BLACK CHESS PAWN
...
2615   '☕'; HOT BEVERAGE
...
0419   'ґ'; CYRILLIC CAPITAL LETTER GHE WITH UPTURN
...
0142   'ł'; LATIN SMALL LETTER L WITH STROKE

Let's check our ґ character:

Parameter
Values

Value

ґ

Name

Cyrillic Small Letter Ghe with Upturn

Codepoint

0491

In Python

'\u0491'

Link

or ł:

Parameter
Values

Value

ł

Name

Hot Beverage

Codepoint

0142

In Python

'\u0142'

Link

or ☕:

Parameter
Values

Value

☕

Name

Hot Beverage

Codepoint

2615

In Python

'\u2615'

Link

String in Python 3 is a sequence of code points.

🪄 Code:

s = 'ґуґл 💝 ☕'
s.upper()

📟 Output:

'ҐУҐЛ 💝 ☕'

🪄 Code:

s[0]

📟 Output:

'ґ'

🪄 Code:

'\u0491 and \u2615'

📟 Output:

'ґ and ☕'

🪄 Code:

w = 'Z\u0142oto daj wiedźminowi'
w[1], w[-7]

📟 Output:

('ł', 'ź')

Via builtin module unicodedata it is possible to get the standardized name of Unicode character or resolve that name into a character:

🪄 Code:

import unicodedata
print(unicodedata.name("ґ"))
print(unicodedata.name("ł"))
print(unicodedata.name("☕"))
print(unicodedata.name("🟥"))

📟 Output:

CYRILLIC SMALL LETTER GHE WITH UPTURN
LATIN SMALL LETTER L WITH STROKE
HOT BEVERAGE
LARGE RED SQUARE

🪄 Code:

cap_char = unicodedata.name("ł").replace("SMALL", "CAPITAL")
print(cap_char)
print(unicodedata.lookup(cap_char))

📟 Output:

LATIN CAPITAL LETTER L WITH STROKE
Ł

Encoding and Decoding

So, let's summary previous section:

  • Unicode is the registry of all known characters

  • Unicode character has a codepoint and generalized name

  • Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal).

So, Unicode is DB with all characters known to human beings. To write them in a file we can use encodings.

Unicode string needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a character encoding, or just an encoding. The opposite procedure is called decoding.

Encoding is a mapping from Unicode DB to the exact bytes used to represent characters.

Note: In Python 3 "utf8" is the default encoding, so we can skip it in encode and decode methods.

  • encoding (transform Unicode into bytes)

    • 'Złoto dla tego Wiedźmina, szybko!'.encode('utf8')

  • decoding (transform bytes into Unicode)

    • b'Wied\xc5\xbamin'.decode('utf8')

As we can see characters present in ASCII are shown as text right away, absent characters are shown in a hex form of bytes used to represent them.

Encoding:

str.encode(encoding='utf-8', errors='strict')

Encode the string using the codec registered for encoding.

errors by default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace' and backslashreplace.

🪄 Code:

'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'.encode('ascii', 'ignore')

📟 Output:

b'Bo w kadym z nas jest Chaos i ad, Dobro i Zo.'

🪄 Code:

'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'.encode('ascii', 'replace')

📟 Output:

b'Bo w ka?dym z nas jest Chaos i ?ad, Dobro i Z?o.'

🪄 Code:

'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'.encode('ascii', 'xmlcharrefreplace')

📟 Output:

b'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'

🪄 Code:

'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'.encode('ascii', 'backslashreplace')

📟 Output:

b'Bo w ka\\u017cdym z nas jest Chaos i \\u0141ad, Dobro i Z\\u0142o.'

Decoding:

bytes.decode(encoding='utf-8', errors='strict')

Decode the bytes using the codec registered for encoding.

errors is by default is 'strict' meaning that decoding errors raise a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'.

UTF-8

UTF-8 is dynamicly sized Unicode Transformation Format

UTF-8 is dynamic and can use not just 1 byte (8bit) but dynamically from 1 byte to 4 byte:

  • 1 byte for ASCII

  • 2 bytes for most Latin-derived and Cyrillic languages

  • 3 bytes for the rest of the basic multilingual plane

  • 4 bytes for Asian languages, symbols and emojis

Few words on a hex format:

  • 0x** - representation of a number

  • \x** - representation in a string

🪄 Code:

0x41

📟 Output:

65

🪄 Code:

"\x41"

📟 Output:

'A'

🪄 Code:

"\x01" # a non printable character

📟 Output:

'\x01'

We can see that this character can be represented in various Unicode-friendly encodings:

Encoding
hex
dec (bytes)
dec
binary

UTF-8

C5 82

197 130

50562

11000101 10000010

UTF-16BE

01 42

1 66

322

00000001 01000010

UTF-16LE

42 01

66 1

16897

01000010 00000001

UTF-32BE

00 00 01 42

0 0 1 66

322

00000000 00000000 00000001 01000010

UTF-32LE

42 01 00 00

66 1 0 0

1107361792

01000010 00000001 00000000 00000000

Notice that in UTF8 it is C5 82 meaning it should be coded as \xc5\x82.

Let's check encoded strings with this character to verify the encoded bytes we supposed to find:

🪄 Code:

'Bo w każdym z nas jest Chaos i Ład, Dobro i Zło.'.encode('utf8')

📟 Output:

b'Bo w ka\xc5\xbcdym z nas jest Chaos i \xc5\x81ad, Dobro i Z\xc5\x82o.'

Here we see that:

  • ż was encoded with 2 bytes: \xc5\xbc

  • Ł was encoded with 2 bytes: \xc5\x81

  • ł was encoded with 2 bytes: \xc5\x82

Ukrainian characters also use 2 bytes symbols:

🪄 Code:

"Сію-вію, сію-вію конопелечки...".encode("utf8")

📟 Output:

b'\xd0\xa1\xd1\x96\xd1\x8e-\xd0\xb2\xd1\x96\xd1\x8e, \xd1\x81\xd1\x96\xd1\x8e-\xd0\xb2\xd1\x96\xd1\x8e \xd0\xba\xd0\xbe\xd0\xbd\xd0\xbe\xd0\xbf\xd0\xb5\xd0\xbb\xd0\xb5\xd1\x87\xd0\xba\xd0\xb8...'

Japanese katakana (word means "Python" - paisonu), as we can each character takes 3 bytes:

🪄 Code:

python_in_japanese = 'パイソン'

print(python_in_japanese.encode('utf8'))
print(int(len(python_in_japanese.encode('utf8')) / len(python_in_japanese)))

📟 Output:

b'\xe3\x83\x91\xe3\x82\xa4\xe3\x82\xbd\xe3\x83\xb3'
3

And lastly, emojis take 3-4 characters:

🪄 Code:

print('☕'.encode('utf8'))
print('🐍'.encode('utf8'))

📟 Output:

b'\xe2\x98\x95'
b'\xf0\x9f\x90\x8d'

There are lot of different encodings Python supports: .

is one of the best encodings made specifically to cover Unicode characters. It is the standard encoding of Unicode. That's why UTF-8 is the standard text encoding in Python, Linux, osX, modern Windows and Web.

Let's check again our popular ł character on .

Unicode
official doc
UTF-8
Unicode Table
https://unicode-table.com/en/0491/
https://unicode-table.com/en/0142/
https://unicode-table.com/en/2615/