String is immutable iterable (sequence) consists of Unicode characters. For Python it's almost like any other sequence (more like tuple which is immutable version of list).
String literals are written in a variety of ways:
Single quotes: 'allows embedded "double" quotes'
Double quotes: "allows embedded 'single' quotes".
Triple quoted: '''Three single quotes''', """Three double quotes"""
🪄 Code:
s1 ="Hello, I'm nice little string"s2 ='Hello, I\'m nice little string'# escaping 'print(s1)print(s2)
📟 Output:
Hello, I'm nice little string
Hello, I'm nice little string
Multiline string (matter of syntax, for Python they are all the same):
🪄 Code:
big_string ="""She’s fifteen, sells flowers at the train station.Sun and berries sweeten the oxygen beyond the mines.Trains stop for a moment, move further on.Soldiers go to the East, soldiers go to the West. (c) Serhiy Zhadan"""print(big_string)
📟 Output:
She’s fifteen, sells flowers at the train station.
Sun and berries sweeten the oxygen beyond the mines.
Trains stop for a moment, move further on.
Soldiers go to the East, soldiers go to the West.
(c) Serhiy Zhadan
Main methods of strings
🪄 Code:
print(dir("some_string"))#Emm... actually all methods...
phone_num ="066-749-99-99"name ="Johnny"name_with_spaces = name +" Walker"# Note that whitespaces are not alpha!phone_num.isalpha(), name.isalpha(), name_with_spaces.isalpha()
📟 Output:
(False, True, False)
Checks of numeric/digits
It can be difficult to comprehend from the start, so the following table will show difference between three related methods (isdecimal(), isdigit(), isnumeric()):
isdecimal()
Only decimal digits: 0123456789
Test
Result
isdecimal('123')
True
isdecimal('1₂34⁵')
False
isdecimal('½¾⅚')
False
isdecimal('一二三四五')
False
isdigit()
decimal digits: 0123456789
super- and subscripts
Test
Result
isdigit('123')
True
isdigit('1₂34⁵')
True
isdigit('½¾⅚')
False
isdigit('一二三四五')
False
isnumeric()
decimal digits: 0123456789
super- and subscripts
vulgar fractions
numeric Unicode characters from other languages
Test
Result
isnumeric('123')
True
isnumeric('1₂34⁵')
True
isnumeric('½¾⅚')
True
isnumeric('一二三四五')
True
The playground code to test these methods on those strings:
🪄 Code:
METHODS ="isdecimal","isdigit","isnumeric"TEST_STRINGS ='123','1₂34⁵','½¾⅚','一二三'for str_ in TEST_STRINGS:for method in METHODS: method_str =f'{repr(str_)}.{method}()'print(f'{method_str:23} 🡒 {eval(method_str)}')
A bit hackish way to check for alphabeticals with spaces - via using .replace()
🪄 Code:
"Hello World".replace(" ", "").isalpha()
📟 Output:
True
Stripping - removing whitespaces
.strip() - remove from the beginning and from the end both
.rstrip() - remove only from the end
.lstrip() - remove only from the beginning
🪄 Code:
whitespaces_str =" Some text goes and goes... "print("Initial string: >>>"+ whitespaces_str +"<<<")print("Stripped string: >>>"+ whitespaces_str.strip() +"<<<")print("R-stripped string: >>>"+ whitespaces_str.rstrip() +"<<<")print("L-stripped string: >>>"+ whitespaces_str.lstrip() +"<<<")
📟 Output:
Initial string: >>> Some text goes and goes... <<<
Stripped string: >>>Some text goes and goes...<<<
R-stripped string: >>> Some text goes and goes...<<<
L-stripped string: >>>Some text goes and goes... <<<
Get character by index (it's possible because string is sequence)
Indexing starts from 0. Negative indexing means counting from the end so -1 is the last item.
Rare case where string can be merged if they are separated by any number of spaces. Strings must be presented by string object themselves not by variables or function call results
🪄 Code:
"Hello"" World"
📟 Output:
'Hello World'
🪄 Code:
some_big_string ="Everything started with music, "\"with scars left by songs "\"heard at fall weddings with other kids my age."print(some_big_string)
📟 Output:
Everything started with music, with scars left by songs heard at fall weddings with other kids my age.
🪄 Code:
some_big_string = ("You will reply today, touching warm letters, ""leafing through them in the dark, confusing vowels with consonants, ""like a typewriter in an old Warsaw office. ")print(some_big_string)
📟 Output:
You will reply today, touching warm letters, leafing through them in the dark, confusing vowels with consonants, like a typewriter in an old Warsaw office.
Unicode
ASCII
Previously characters used in text data were limited by encoding standard called ASCII (American Standard Code for Information Interchange). The key point was "American" so all non-latin characters were missing in that table.
ASCII basics were simple - single byte of data (8 bits) were used. The first 7 bits were used to code the identifier of the character so totally ASCII had 128 characters (2^7). This was done because at that time they though that 128 characters is enough and the last bit was used either for error checking or enabling italics or was set to plain 0.
Anyway, 128 ASCII characters were: 26 uppercase letters, 26 lowercase letters, 10 digits, punctuation symbols, some spacing characters, and some nonprintable control codes like (line feed), (carriage return), \a (bell), \b (backspace) etc:
So, ASCII is great we can't write neither cyrillic texts like ґуґл з'їв яйко-сподівайко nor swedish like Surströmming with it. The reason is simple - this table just doesn't have needed codepoints for non-latin characters.
That's why at some point other encodings (tables of codepoints) used all 8 bits were created:
latin-1, windows-1252
These cover all main European languages
latin-2
Central or Eastern European
windows-1251, koi8
These cover most cyrillic languages
big5
Traditional Chinese
To encode Python's string into some endocing the string method encode(coding) is used:
🪄 Code:
print("Surströmming".encode("latin_1"))
# print("Surströmming".encode("windows-1251")) # WON"T WORK
print("Piękna jest taka pewność, ale niepewność jest piękniejsza.".encode("latin2"))
print("Ґуґл з'їв яйко-сподівайко".encode("windows-1251"))
# print("Ґуґл з'їв яйко-сподівайко".encode("latin1")) # WON'T WORK
📟 Output:
b'Surstr\xf6mming'
b'Pi\xeakna jest taka pewno\xb6\xe6, ale niepewno\xb6\xe6 jest pi\xeakniejsza.'
b"\xa5\xf3\xb4\xeb \xe7'\xbf\xe2 \xff\xe9\xea\xee-\xf1\xef\xee\xe4\xb3\xe2\xe0\xe9\xea\xee"
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Input In [6], in <cell line: 7>()
4 print("Piękna jest taka pewność, ale niepewność jest piękniejsza.".encode("latin2"))
6 print("Ґуґл з'їв яйко-сподівайко".encode("windows-1251"))
----> 7 print("Ґуґл з'їв яйко-сподівайко".encode("latin1"))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)