Strings
String is immutable iterable (sequence) consists of Unicode characters. For Python it's almost like any other sequence (more like tuple which is immutable version of list).
String literals are written in a variety of ways:
Single quotes:
'allows embedded "double" quotes'
Double quotes:
"allows embedded 'single' quotes".
Triple quoted:
'''Three single quotes''', """Three double quotes"""
🪄 Code:
📟 Output:
Multiline string (matter of syntax, for Python they are all the same):
🪄 Code:
📟 Output:
Main methods of strings
🪄 Code:
📟 Output:
Cosmetic methods:
lower(), upper()
Return new string - in lowercase, uppercase
title(), capitalize()
Return new string - all words starts with uppercase, with first word in uppercase
center(w)
, ljust(w)
, rjust(w)
Return new string centered or justified to left/right in a string of length w
strip()
, rstrip()
, lstrip()
Return new string with removed whitespaces
replace(s, r[, count])
Return new string with all sub-strings s
replaced by string r
Checks:
s in some_string
Return True/False - if sub-string s
is part of some_string
islower()
, isupper()
Return True/False - if all character are in lower/upper case
startswith(s), endswith(s)
Return True/False - if string starts/ends with a sub-string s
isalpha(), isalnum()
Return True/False - are all characters: alphabetical, alpha-numerical?
isdecimal(), isdigit(), isnumeric()
Return True/False - are all characters: regular digits, digits with super/subscripts or any numeric Unicode character?
isspace()
Return True/False - are all characters whitespaces (" "
, "\n"
, "\t"
etc.) ?
isprintable()
Return True/False - if all characters are printable
Searching:
count(s)
Return number of sub-string s
is part of string
index(s)
Return index of first sub-string s
that found in a string or ValueError
find(s)
Return index of first sub-string s
that found in a string or -1
len(some_string)
Return int - length of string
Split, join, obtaining parts of string:
split(s)
Return list of string parts splitted by delimiter s
(whitespace by default)
splitlines(s)
Return list of strings splitted by line ending
s.join(str_iterable)
Return new string - result of merging all strings from iterable with strings using delimiter s
some_string[i]
Return new string - one character by index i
some_string[n1:n2:step]
Return new string - sub-string from n1
till n2
(non-inclusive) with step step
Some examples
Adding, multiplying(!) strings
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Get length
🪄 Code:
📟 Output:
Cosmetic/styling methods:
lower, upper, title, capitalize
🪄 Code:
📟 Output:
Various checking for lower/upper, all digits, all letters. Returns True/False.
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Nice examples regarding checks:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Checks of numeric/digits
It can be difficult to comprehend from the start, so the following table will show difference between three related methods (isdecimal(), isdigit(), isnumeric()
):
isdecimal()
Only decimal digits:
0123456789
isdecimal('123')
True
isdecimal('1₂34⁵')
False
isdecimal('½¾⅚')
False
isdecimal('一二三四五')
False
isdigit()
decimal digits:
0123456789
super- and subscripts
isdigit('123')
True
isdigit('1₂34⁵')
True
isdigit('½¾⅚')
False
isdigit('一二三四五')
False
isnumeric()
decimal digits:
0123456789
super- and subscripts
vulgar fractions
numeric Unicode characters from other languages
isnumeric('123')
True
isnumeric('1₂34⁵')
True
isnumeric('½¾⅚')
True
isnumeric('一二三四五')
True
The playground code to test these methods on those strings:
🪄 Code:
📟 Output:
Checking for space-containing strings
To check for all-spaces string - use
.isspace()
🪄 Code:
📟 Output:
A bit hackish way to check for alphabeticals with spaces - via using
.replace()
🪄 Code:
📟 Output:
Stripping - removing whitespaces
.strip()
- remove from the beginning and from the end both.rstrip()
- remove only from the end.lstrip()
- remove only from the beginning
🪄 Code:
📟 Output:
Get character by index (it's possible because string is sequence)
Indexing starts from 0. Negative indexing means counting from the end so -1 is the last item.
🪄 Code:
📟 Output:
Slicing
some_str[start:stop[:step]]
(again sequence-like syntax) - getting part of sequence.
Returns items with indexes starting with first argument (start) till second (stop) non-included.
If argument omitted - by default it is either start or end contextually.
Optional third argument step - step.
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Splitting/joining
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Concatenation
Rare case where string can be merged if they are separated by any number of spaces. Strings must be presented by string object themselves not by variables or function call results
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
🪄 Code:
📟 Output:
Unicode
ASCII
Previously characters used in text data were limited by encoding standard called ASCII (American Standard Code for Information Interchange). The key point was "American" so all non-latin characters were missing in that table.
ASCII basics were simple - single byte of data (8 bits) were used. The first 7 bits were used to code the identifier of the character so totally ASCII had 128 characters (2^7
). This was done because at that time they though that 128 characters is enough and the last bit was used either for error checking or enabling italics or was set to plain 0
.
Anyway, 128 ASCII characters were: 26
uppercase letters, 26
lowercase letters, 10
digits, punctuation symbols, some spacing characters, and some nonprintable control codes like (line feed), (carriage return), \a
(bell), \b
(backspace) etc:
🪄 Code:
📟 Output:
In Python we can get ASCII "index" of the character with builtin function ord
and get the character by that index with function chr
.
🪄 Code:
📟 Output:
Please note that in fact these functions work with Unicode table (that we will cover in a minute) but Unicode table begins with ASCII and extends it.
We can get all 128 characters of ASCII:
🪄 Code:
📟 Output:
So, ASCII is great we can't write neither cyrillic texts like ґуґл з'їв яйко-сподівайко
nor swedish like Surströmming
with it. The reason is simple - this table just doesn't have needed codepoints for non-latin characters.
That's why at some point other encodings (tables of codepoints) used all 8 bits were created:
latin-1
,windows-1252
These cover all main European languages
latin-2
Central or Eastern European
windows-1251
,koi8
These cover most cyrillic languages
big5
Traditional Chinese
To encode Python's string into some endocing the string method encode(coding)
is used:
🪄 Code:
📟 Output:
Last updated
Was this helpful?