Discover Python, Part 3: Explore the Python type hierarchy

Using strings

Unlike many other programming languages, the Python language does not include a special data type to handle a single character, such as "a" or "z." In contrast, Python takes a different approach: It uses a class designed especially for holding sequences of characters. This article introduces the string class and demonstrates different ways in which you can use a string within Python.

Robert Brunner (rb@ncsa.uiuc.edu), Research Scientist, National Center for Supercomputing Applications

Robert BrunnerRobert J. Brunner is a research scientist at the National Center for Supercomputing Applications and an assistant professor of astronomy at the University of Illinois, Urbana-Champaign. He has published several books, as well as numerous articles and tutorials, on a range of topics.



02 August 2005

In the first article in this series, Discover Python, Part 1: Python's built-in numerical types, I introduced Python's simple built-in numerical data types. If you have ever used another programming language, these data types probably seemed familiar. While I didn't mention it in that article, one obvious difference between Python and many other programming languages, like C or the Java™ programming language, is the absence of a built-in character data type. Because working with text-based data is a common practice, you might be wondering how Python deals with character-based data. Simply put, Python provides an elegant solution by including an immutable collection-based class designed to deal exclusively with sequences of characters.

The string

Creating a string object in Python is easy. You simply place the desired text inside a pair of quotation marks and voila: a new string (see Listing 1). If you're paying attention, you might be confused. After all, there are two types of quotations you can use: single quotation marks (') and double quotation marks ("). Fortunately, Python makes things easy once again. You can use either type of quotation mark to indicate a string in Python, as long as you're consistent. If you start a string with a single quotation mark, you must end with a single quotation mark, and vice versa. If you don't follow this rule, you will get a SyntaxError exception.

Listing 1. Creating a string in Python
>>> sr="Discover Python"
>>> type(sr)
<type 'str'>
>>> sr='Discover Python'
>>> type(sr)
<type 'str'>
>>> sr="Discover Python: It's Wonderful!"       
>>> sr='Discover Python"
  File "<stdin>", line 1
    sr='Discover Python"
                        ^
SyntaxError: EOL while scanning single-quoted string
>>> sr="Discover Python: \
... It's Wonderful!"
>>> print sr
Discover Python: It's Wonderful!

Notice a couple of other important points from Listing 1, in addition to the proper quoting of strings. First, you can mix single and double quotation marks when creating a string, as long as the string uses the same type of quotation mark at the beginning and end. This flexibility allows Python to easily hold normal textual data, which might need to use the single quotation mark for a contracted verb form or to indicate possession, as well as double quotation marks to indicate spoken text.

Second, if a string is too long for a single line, you can wrap the string using the Python continuation character: the backslash (\). Internally, the newline character is ignored when creating the string, as is shown when the string is printed. You can combine these two features to create strings that contain long passages, as shown in Listing 2.

Listing 2. Creating a long string
>>> passage = 'When using the Python programming language, one must proceed \
... with caution. This is because Python is so easy to use and can be so \
... much fun. Failure to follow this warning may lead to shouts of \
... "WooHoo" or "Yowza".'
>>> print passage
When using the Python programming language, one must proceed with caution. 
This is because Python is so easy to use, and can be so much fun. 
Failure to follow this warning may lead to shouts of "WooHoo" or "Yowza".

Editor's note: The above example was wrapped to make the page layout properly. Trust us, it appeared originally on one long line.

Notice that when I printed the passage string, however, all the formatting was removed, making for one very long string. Typically, you use control characters to indicate simple formatting within a string. For example, to indicate that a new line should be started, you can use the newline control character (\n); to indicate that a tab (preset number of spaces) should be inserted, you can use the tab control character (\t), as shown in Listing 3.

Listing 3. Using control characters in a string
>>> passage='\tWhen using the Python programming language, one must proceed\n\
... \twith caution. This is because Python is so easy to use, and\n\
... \tcan be so much fun. Failure to follow this warning may lead\n\
... \tto shouts of "WooHoo" or "Yowza".'
>>> print passage
        When using the Python programming language, one must proceed
        with caution. This is because Python is so easy to use, and
        can be so much fun. Failure to follow this warning may lead
        to shouts of "WooHoo" or "Yowza".
>>> passage=r'\tWhen using the Python programming language, one must proceed\n\
... \twith caution. This is because Python is so easy to use, and\n\
... \tcan be so much fun. Failure to follow this warning may lead\n\
... \tto shouts of "WooHoo" or "Yowza".'
>>> print passage
\tWhen using the Python programming language, one must proceed\n\
\twith caution. This is because Python is so easy to use, and\n\
\tcan be so much fun. Failure to follow this warning may lead\n\
\tto shouts of "WooHoo" or "Yowza".

The first passage in Listing 3 used control characters in the way you would expect. The passage was formatted nicely and easy to read. The second example, however, was formatted, but it introduced what is known as a raw string, in which the control characters are not applied. You can always spot a raw string because the starting quotation mark for the string is preceded by an r, which is short for raw.

I don't know about you, but while workable, creating a passage string seemed rather difficult. Surely there must be a better way. True to form, Python provides a much simpler way to create long strings that preserves the formatting you use when creating the string. This technique uses three double quotation marks (or three single quotation marks) to begin and end the long string. Within in the string, you can use as many single and double quotation marks as you like (see Listing 4).

Listing 4. Using a triple-quoted string
>>> passage = """
...         When using the Python programming language, one must proceed
...         with caution. This is because Python is so easy to use, and
...         can be so much fun. Failure to follow this warning may lead
...         to shouts of "WooHoo" or "Yowza".
... """
>>> print passage
                
        When using the Python programming language, one must proceed
        with caution. This is because Python is so easy to use, and
        can be so much fun. Failure to follow this warning may lead
        to shouts of "WooHoo" or "Yowza".

The string as an object

After reading either of the first two articles in this series, one statement should be popping in your head right now. In Python, everything is an object. So far, I've said nothing about the object nature of strings in Python. But true to form, strings in Python are objects. In fact, a string object is an instance of the str class. As you saw in Discover Python, Part 2, the Python interpreter includes a built-in help facility, which, as shown in Listing 5, can provide information on the str class.

Listing 5. Getting help on strings
>>> help(str)
         
Help on class str in module __builtin__:
                    
class str(basestring)
|  str(object) -> string
|  
|  Return a nice string representation of the object.
|  If the argument is a string, the return value is the same object.
|  
|  Method resolution order:
|      str
|      basestring
|      object
|  
|  Methods defined here:
|  
|  __add__(...)
|      x.__add__(y) <==> x+y
|  
...

The strings I've been creating using the single, double, or triple quotation mark syntax are still string objects. But you can also explicitly create a string object by using the str class constructor, as shown in Listing 6. The constructor can take a simple built-in numerical type or character data. Either way, the input is changed into a new string object.

Listing 6. Creating strings
>>> str("Discover python")
'Discover python'
>>> str(12345)
'12345'
>>> str(123.45)
'123.45'
>>> "Wow," + " that " + "was awesome."
'Wow, that was awesome.'
>>> "Wow,"" that ""was Awesome"
'Wow, that was Awesome'
>>> "Wow! "*5
'Wow! Wow! Wow! Wow! Wow! '
>>>  sr = str("Hello ")
>>>  id(sr)
5560608
>>>  sr += "World"
>>>  sr
'Hello World'
>>>  id(sr)
3708752

The examples in Listing 6 also demonstrate several other important points regarding Python strings. First, you can create a new string by adding other strings together, either using the + operator or by just sticking strings together using the appropriate quotes. Second, if you need to repeat a small string to create a bigger string, you can use the * operator, which multiplies a string out a set number of times. At the start of this article, I said that in Python, a string is an immutable sequence of characters. The last few lines of the previous example demonstrate this, as I first create a string and then modify it by adding additional characters. As you can see from the output from the two calls to the id method, a new string object was created to hold the result of adding text to the original string.

The str class contains a large number of useful methods for manipulating strings. Discussing all of them here would quickly become rather tedious; besides, you can always use the help interpreter for that. Instead, let's look at four functions that are useful in their own right and demonstrate the utility of the rest of the str class methods. Listing 7 demonstrates the upper, lower, split, and join methods.

Listing 7. String methods
>>> sr = "Discover Python!"
>>> sr.upper()
'DISCOVER PYTHON!'
>>> sr.lower()
'discover python!'
>>> sr = "This is a test!"
>>> sr.split()
['This', 'is', 'a', 'test!']
>>> sr = '0:1:2:3:4:5:6:7:8:9'
>>> sr.split(':')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>> sr=":"
>>> tp = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9')
>>> sr.join(tp)
'0:1:2:3:4:5:6:7:8:9'

The first two methods -- upper and lower -- are easy to understand. They simply convert the string to all uppercase or all lowercase letters, respectively. The split method is useful because it splits a string into a sequence of smaller strings, using a token character (or any character in a given sequence of characters) as an indicator of where to chop. So, the first split method example splits the string "This is a test!" using the default token, which is any whitespace character. (This sequence includes the space, a tab character, and newline characters). The second split method demonstrates using a different token character -- in this case, a colon -- to split a string into a sequence of strings. The last example shows how to use the join method, which is the opposite of the split method, to make a big string from a sequence of smaller strings. In this case, I join together a sequence of single-character strings contained in a tuple using the colon character.


The string as a container for characters

At the beginning of this article, I said (in a rather winded manner) that a string in Python is an immutable sequence of characters. Part 2 of this series, Discover Python, Part 2, introduced the tuple, which also was an immutable sequence. The tuple supported accessing elements in the sequence using index notation, chopping out elements from the sequence using slices, and creating new tuples using a specific slice or by adding together different slices. Given that background, you might wonder if the same tricks can be applied to the Python string. As shown in Listing 8, the answer is an obvious "Yes."

Listing 8. String methods
>>> sr="0123456789"
>>> sr[0]
'0'
>>> sr[1] + sr[0]    
'10'
>>> sr[4:8]     # Give me elements four through seven, inclusive
'4567'
>>> sr[:-1]     # Give me all elements but the last one
'012345678'
>>> sr[1:12]    # Slice more than you can chew, no problem
'123456789'
>>> sr[:-20]    # Go before the start?
''
>>> sr[12:]     # Go past the end?
''
>>> sr[0] + sr[1:5] + sr[5:9] + sr[9]
'0123456789'
>>> sr[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: string index out of range
>>> len(sr)     # Sequences have common methods, like get my length
10

Treating a string as a sequence of characters in Python is simple. You can grab a single element, add different elements together, slice out several elements, and even add together different slices. One very useful feature of slicing is that slicing too much, going before the start or past the end, doesn't throw an exception but simply defaults to the start or end of the sequence, as appropriate. In contrast, if you try to access a single element with an index outside the allowed range, you get an exception. This behavior demonstrates why the len method is so important.


The string: A powerful tool

In this article, I introduced the Python string, which is an immutable sequence of characters. You can easily create strings in Python by using several techniques, including using single or double quotation marks or, for more flexibility, using a set of three quotation marks (the triple quote). Given that everything in Python is an object, you can use the underlying str class methods to gain additional power or use the string's sequence functionality directly.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=90950
ArticleTitle=Discover Python, Part 3: Explore the Python type hierarchy
publish-date=08022005