mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
933dc9459a
--HG-- rename : humansize.py => examples/humansize.py rename : roman1.py => examples/roman1.py rename : roman2.py => examples/roman2.py rename : roman3.py => examples/roman3.py rename : roman4.py => examples/roman4.py rename : roman5.py => examples/roman5.py rename : roman6.py => examples/roman6.py rename : roman7.py => examples/roman7.py rename : roman8.py => examples/roman8.py rename : romantest1.py => examples/romantest1.py rename : romantest2.py => examples/romantest2.py rename : romantest3.py => examples/romantest3.py rename : romantest4.py => examples/romantest4.py rename : romantest5.py => examples/romantest5.py rename : romantest6.py => examples/romantest6.py rename : romantest7.py => examples/romantest7.py rename : romantest8.py => examples/romantest8.py
387 lines
33 KiB
HTML
387 lines
33 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>Strings - Dive into Python 3</title>
|
|
<link rel=stylesheet type=text/css href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 3}
|
|
.s{text-decoration:line-through}
|
|
</style>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#strings>Dive Into Python 3</a> <span>‣</span>
|
|
<h1>Strings</h1>
|
|
<blockquote class=q>
|
|
<p><span>❝</span> I’m telling you this ’cause you’re one of my friends.<br>
|
|
My alphabet starts where your alphabet ends! <span>❞</span><br>— Dr. Seuss, On Beyond Zebra!
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=boring-stuff>Some boring stuff you need to understand before you can dive in</h2>
|
|
<p class=f>Did you know that the people of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> have the smallest alphabet in the world? Their <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
|
|
|
|
<p>When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
|
|
|
|
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish.
|
|
|
|
<p>Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable.
|
|
|
|
<p>There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the <abbr>ASCII</abbr> encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, <i class=baa>&</i>c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte.
|
|
|
|
<p>Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the <code>ñ</code> character in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with <abbr>ASCII</abbr> in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), <i class=baa>&</i>c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.
|
|
|
|
<p>Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent.
|
|
|
|
<p>That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text”. Source code was <abbr>ASCII</abbr>, and everyone else used word processors, which defined their own (non-text) formats that tracked character encoding information along with rich styling, <i class=baa>&</i>c. People read these documents with the same word processing program as the original author, so everything worked, more or less.
|
|
|
|
<p>Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of “plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse.
|
|
|
|
<p>Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you’ve ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that’s hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn’t that sound fun?
|
|
|
|
<p>Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means this character; poof, now you’re in Mac Greek mode, so 241 means some other character.) And of course you’ll want to search <em>those</em> documents, too.
|
|
|
|
<p>Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.”
|
|
|
|
<h2 id=one-ring-to-rule-them-all>Unicode</h2>
|
|
|
|
<p><i>Enter Unicode.</i>
|
|
|
|
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 2<sup>32</sup>−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. <code>U+0041</code> is always <code>'A'</code>, even if your language doesn't have an <code>'A'</code> in it.
|
|
|
|
<p>Right away, problems leap out at you. 4 bytes? For every single character<span title="interrobang!">‽</span> That seems awfully wasteful, especially for English and Spanish, which need less than 256 numbers to express every possible character. [FIXME incomplete paragraph]
|
|
|
|
<p>Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph]
|
|
|
|
<p>[FIXME stuff about UTF-32, UTF-16, and finally UTF-8]
|
|
|
|
<p>[FIXME FIXME FIXME, damn it!]
|
|
|
|
<div class=s title="ignore this, it's just notes for myself">
|
|
<p>UTF-8 uses the same characters as 7-bit <abbr>ASCII</abbr> for 0 through 127
|
|
|
|
|
|
|
|
|
|
<p>When dealing with Unicode data, you may at some point need to convert the data back into one of these other legacy encoding
|
|
systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
|
|
scheme, or to print it to a non-Unicode-aware terminal or printer.
|
|
|
|
|
|
|
|
|
|
FIXME: update for Python 3
|
|
|
|
<p>Python has had Unicode support throughout the language since version 2.0. The <abbr>XML</abbr> package uses Unicode to store all parsed <abbr>XML</abbr> data, but you can use Unicode anywhere.
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>s = u'Dive in'</kbd> <span>①</span>
|
|
<samp class=p>>>> </samp><kbd>s</kbd>
|
|
u'Dive in'
|
|
<samp class=p>>>> </samp><kbd>print s</kbd> <span>②</span>
|
|
Dive in</pre>
|
|
<ol>
|
|
<li>To create a Unicode string instead of a regular <abbr>ASCII</abbr> string, add the letter “<code>u</code>” before the string. Note that this particular string doesn't have any non-<abbr>ASCII</abbr> characters. That's fine; Unicode is a superset of <abbr>ASCII</abbr> (a very large superset at that), so any regular <abbr>ASCII</abbr> string can also be stored as Unicode.
|
|
<li>When printing a string, Python will attempt to convert it to your default encoding, which is usually <abbr>ASCII</abbr>. (More on this in a minute.) Since this Unicode string is made up of characters that are also <abbr>ASCII</abbr> characters, printing it has the same result as printing a normal <abbr>ASCII</abbr> string; the conversion is seamless, and if you didn't know that <var>s</var> was a Unicode string, you'd never notice the difference.
|
|
</ol>
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>①</span>
|
|
<samp class=p>>>> </samp><kbd>print s</kbd> <span>②</span>
|
|
<samp class=traceback>Traceback (innermost last):
|
|
File "<interactive input>", line 1, in ?
|
|
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
|
<samp class=p>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>③</span>
|
|
La Peña</pre>
|
|
<ol>
|
|
<li>The real advantage of Unicode, of course, is its ability to store non-<abbr>ASCII</abbr> characters, like the Spanish “<code>ñ</code>” (<code>n</code> with a tilde over it). The Unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
|
|
<li>Remember I said that the <code>print</code> function attempts to convert a Unicode string to <abbr>ASCII</abbr> so it can print it? Well, that's not going to work here, because your Unicode string contains non-<abbr>ASCII</abbr> characters, so Python raises a <samp>UnicodeError</samp> error.
|
|
<li>Here's where the conversion-from-Unicode-to-other-encoding-schemes comes in. <var>s</var> is a Unicode string, but <code>print</code> can only print a regular string. To solve this problem, you call the <code>encode</code> method, available on every Unicode string, to convert the Unicode string to a regular string in the given encoding scheme,
|
|
which you pass as a parameter. In this case, you're using <code>latin-1</code> (also known as <code>iso-8859-1</code>), which includes the tilde-n (whereas the default <abbr>ASCII</abbr> encoding scheme did not, since it only includes characters numbered 0 through 127).
|
|
</ol>
|
|
</div>
|
|
|
|
<h2 id=divingin>Diving in</h2>
|
|
|
|
<p>Let's take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
|
|
|
|
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
|
|
<pre><code>
|
|
<a>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], <span>①</span></a>
|
|
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
|
|
|
|
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
|
<a> """Convert a file size to human-readable form. <span>②</span></a>
|
|
|
|
Keyword arguments:
|
|
size -- file size in bytes
|
|
a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
|
|
if False, use multiples of 1000
|
|
|
|
Returns: string
|
|
|
|
<a> """ <span>③</span></a>
|
|
if size < 0:
|
|
<a> raise ValueError('number must be non-negative') <span>④</span></a>
|
|
|
|
multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
|
|
for suffix in SUFFIXES[multiple]:
|
|
size /= multiple
|
|
if size < multiple:
|
|
<a> return "{0:.1f} {1}".format(size, suffix) <span>⑤</span></a>
|
|
|
|
raise ValueError('number too large')</code></pre>
|
|
<ol>
|
|
<li><code>'KB'</code>, <code>'MB'</code>, <code>'GB'</code>… those are each strings. Python strings can be defined with either single quotes (<code>'</code>) or double quotes (<code>"</code>).<!--"-->
|
|
<li>Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string.
|
|
<li>These three-in-a-row quotes end the docstring.
|
|
<li>There's another string, being passed to the exception as a human-readable error message.
|
|
<li>There's a… whoa, what the heck is that?
|
|
</ol>
|
|
|
|
<h2 id=formatting-strings>Formatting strings</h2>
|
|
|
|
<p>Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>username = "mark"</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>password = "PapayaWhip"</kbd> <span>①</span></a>
|
|
<a><samp class=p>>>> </samp><kbd>"{0}'s password is {1}".format(username, password)</kbd> <span>②</span></a>
|
|
<samp>"mark's password is PapayaWhip"</samp></pre>
|
|
<ol>
|
|
<li>No, my password is not really <kbd>PapayaWhip</kbd>.
|
|
<li>There's a lot going on here. First, that's a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code>format()</code> method.
|
|
</ol>
|
|
|
|
<h3 id=compound-field-names>Compound field names</h3>
|
|
|
|
<p>The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the <code>format()</code> method. That means that <code>{0}</code> is replaced by the first argument (<var>username</var> in this case), <code>{1}</code> is replaced by the second argument (<var>password</var>), <i class=baa>&</i>c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import humansize</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>si_suffixes = humansize.SUFFIXES[1000]</kbd> <span>①</span></a>
|
|
<samp class=p>>>> </samp><kbd>si_suffixes</kbd>
|
|
<samp>['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']</samp>
|
|
<a><samp class=p>>>> </samp><kbd>"1000{0[0]} = 1{0[1]}".format(si_suffixes)</kbd> <span>②</span></a>
|
|
<samp>'1000KB = 1MB'</samp>
|
|
</pre>
|
|
<ol>
|
|
<li>Rather than calling any function in the <code>humansize</code> module, you're just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes.
|
|
<li>This looks complicated, but it's not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{1[0]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces — including <code>1000</code>, the equals sign, and the spaces — is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
|
|
</ol>
|
|
|
|
<p>What this example shows is that <em>format specifers can access items and properties of data structures using (almost) Python syntax</em>. This is called <i>compound field names</i>. The following compound field names "just work":
|
|
|
|
<ul>
|
|
<li>Passing a list, and accessing an item of the list by index (as in the previous example)
|
|
<li>Passing a dictionary, and accessing a value of the dictionary by key
|
|
<li>Passing a module, and accessing its variables and functions by name
|
|
<li>Passing a class instance, and accessing its properties and methods by name
|
|
<li><em>Any combination of the above</em>
|
|
</ul>
|
|
|
|
<p>Just to blow your mind, here's an example that combines all of the above:
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import humansize</kbd>
|
|
<samp class=p>>>> </samp><kbd>import sys</kbd>
|
|
<samp class=p>>>> </samp><kbd>"1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}".format(sys)</kbd>
|
|
<samp>'1MB = 1000KB'</samp></pre>
|
|
|
|
<p>Here's how it works:
|
|
|
|
<ul>
|
|
<li>The <code>sys</code> module holds information about the currently running Python instance. Since you just imported it, you can pass the <code>sys</code> module itself as an argument to the <code>format()</code> method. So the replacement field <code>{0}</code> refers to the <code>sys</code> module.
|
|
<li><code>sys.modules</code> is a dictionary of all the modules that have been imported in this Python instance. The keys are the module names as strings; the values are the module objects themselves. So the replacement field <code>{0.modules}</code> refers to the dictionary of imported modules.
|
|
<li><code>sys.modules["humansize"]</code> is the <code>humansize</code> module which you just imported. The replacement field <code>{0.modules[humansize]}</code> refers to the <code>humansize</code> module. Note the slight difference in syntax here. In real Python code, the keys of the <code>sys.modules</code> dictionary are strings; to refer to them, you need to put quotes around the module name (<i>e.g.</i> <code>"humansize"</code>). But within a replacement field, you skip the quotes around the dictionary key name (<i>e.g.</i> <code>humansize</code>).
|
|
<li><code>sys.modules["humansize"].SUFFIXES</code> is the dictionary defined at the top of the <code>humansize</code> module. The replacement field <code>{0.modules[humansize].SUFFIXES}</code> refers to that dictionary.
|
|
<li><code>sys.modules["humansize"].SUFFIXES[1000]</code> is a list of <abbr>SI</abbr> suffixes: <code>['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']</code>. So the replacement field <code>{0.modules[humansize].SUFFIXES[1000]}</code> refers to that list.
|
|
<lI><code>sys.modules["humansize"].SUFFIXES[1000][0]</code> is the first item of the list of <abbr>SI</abbr> suffixes: <code>'KB'</code>. Therefore, the complete replacement field <code>{0.modules[humansize].SUFFIXES[1000][0]}</code> is replaced by the two-character string <code>KB</code>.
|
|
</ul>
|
|
|
|
<h3 id=format-specifiers>Format specifiers</h3>
|
|
|
|
<p>But wait! There's more! Let's take another look at that strange line of code from <code>humansize.py</code>:
|
|
|
|
<pre><code>if size < multiple:
|
|
return "{0:.1f} {1}".format(size, suffix)</code></pre>
|
|
|
|
<p><code>{1}</code> is replaced with the second argument passed to the <code>format()</code> method, which is <var>suffix</var>. But what is <code>{0:.1f}</code>? It's two things: <code>{0}</code>, which you recognize, and <code>:.1f</code>, which you don't. The second half (including and after the colon) defines the <i>format specifier</i>, which further refines how the replaced variable should be formatted.
|
|
|
|
<blockquote class="note compare clang">
|
|
<p><span>☞</span>Format specifiers allow you to munge the replacement text in a variety of useful ways, like the <code>printf()</code> function in C. You can add zero- or space-padding, align strings, control decimal precision, and even convert numbers to hexadecimal.
|
|
</blockquote>
|
|
|
|
<p>Within a replacement field, a colon (<code>:</code>) marks the start of the format specifier. The format specifier “<code>.1</code>” means “round to the nearest tenth” (<i>i.e.</i> display only one digit after the decimal point). The format specifier “<code>f</code>” means “fixed-point number” (as opposed to exponential notation or some other decimal representation). Thus, given a <var>size</var> of <code>698.25</code> and <var>suffix</var> of <code>'GB'</code>, the formatted string would be <code>'698.3 GB'</code>, because <code>698.25</code> gets rounded to one decimal place, then the suffix is appended after the number.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>"{0:.1f} {1}".format(698.25, 'GB')</kbd>
|
|
<samp>'698.3 GB'</samp></pre>
|
|
|
|
<p>For all the gory details on format specifiers, consult the <a href="http://docs.python.org/dev/3.0/library/string.html#format-specification-mini-language">Format Specification Mini-Language</a> in the official Python documentation.
|
|
|
|
<h2 id=common-string-methods>Other common string methods</h2>
|
|
|
|
<p>Besides formatting, strings can do a number of other useful tricks.
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd>s = """Finished files are the re-</kbd> <span>①</span></a>
|
|
<samp class=p>... </samp><kbd>sult of years of scientif-</kbd>
|
|
<samp class=p>... </samp><kbd>ic study combined with the</kbd>
|
|
<samp class=p>... </samp><kbd>experience of years."""</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>s.splitlines()</kbd> <span>②</span></a>
|
|
<samp>['Finished files are the re-',
|
|
'sult of years of scientif-',
|
|
'ic study combined with the',
|
|
'experience of years.']</samp>
|
|
<a><samp class=p>>>> </samp><kbd>print(s.lower())</kbd> <span>③</span></a>
|
|
<samp>finished files are the re-
|
|
sult of years of scientif-
|
|
ic study combined with the
|
|
experience of years.</samp>
|
|
<a><samp class=p>>>> </samp><kbd>s.lower().count("f")</kbd> <span>④</span></a>
|
|
<samp>6</samp></pre>
|
|
<ol>
|
|
<li>You can input multi-line strings in the Python interactive shell. Once you start a multi-line string with triple quotation marks, just hit <kbd>ENTER</kbd> and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next <kbd>ENTER</kbd> will execute the command (in this case, assigning the string to <var>s</var>).
|
|
<li>The <code>splitlines()</code> method takes one multi-line string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
|
|
<li>The <code>lower()</code> method converts the entire string to lowercase. (Similarly, the <code>upper()</code> method converts a string to uppercase.)
|
|
<li>the <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence!
|
|
</ol>
|
|
|
|
<!--
|
|
<p>What else can strings do? Here's a common idiom I use for getting bits of data out of semi-structured strings.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import subprocess</kbd>
|
|
<samp class=p>>>> </samp><kbd>df = subprocess.getoutput('df -x tmpfs')</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(df)</kbd>
|
|
<samp>Filesystem 1K-blocks Used Available Use% Mounted on
|
|
/dev/sda1 461215812 73256908 364529712 17% /
|
|
/dev/sdb1 721075720 620495832 63951288 91% /backup</samp>
|
|
<samp class=p>>>> </samp><kbd>
|
|
-->
|
|
|
|
<!--
|
|
['capitalize', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
|
|
-->
|
|
|
|
<!--
|
|
<p>[FIXME is it worth keeping this section on joining lists / splitting strings? All the examples are from an old code sample that isn't used at all anymore.]
|
|
|
|
<div class=s>
|
|
<p>You have a list of key-value pairs in the form <code><var>key</var>=<var>value</var></code>, and you want to join them into a single string. To join any list of strings into a single string, use the <code>join</code> method of a string object.
|
|
|
|
<p>Here is an example of joining a list from the <code>buildConnectionString</code> function:
|
|
|
|
<pre><code>return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</code></pre>
|
|
|
|
<p>One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
|
|
is an object. You might have thought I meant that string <em>variables</em> are objects. But no, look closely at this example and you'll see that the string <code>";"</code> itself is an object, and you are calling its <code>join</code> method.
|
|
<p>The <code>join</code> method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
|
|
|
|
|
|
|
|
|
|
<code>join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.
|
|
|
|
|
|
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
|
|
<samp class=p>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
|
|
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
|
<samp class=p>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
|
|
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre>
|
|
|
|
<p>This string is then returned from the <code>odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.
|
|
|
|
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called <code>split</code>.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
|
|
<samp class=p>>>> </samp><kbd>s = ";".join(li)</kbd>
|
|
<samp class=p>>>> </samp><kbd>s</kbd>
|
|
'server=mpilgrim;uid=sa;database=master;pwd=secret'
|
|
<samp class=p>>>> </samp><kbd>s.split(";")</kbd> <span>①</span>
|
|
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
|
<samp class=p>>>> </samp><kbd>s.split(";", 1)</kbd> <span>②</span>
|
|
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre>
|
|
<ol>
|
|
<li><code>split</code> reverses <code>join</code> by splitting a string into a multi-element list. Note that the delimiter (“<code>;</code>”) is stripped out completely; it does not appear in any of the elements of the returned list.
|
|
<li><code>split</code> takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<code><var>anystring</var>.<code>split</code>(<var>delimiter</var>, 1)</code> is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
|
|
|
|
|
|
|
|
</div>
|
|
-->
|
|
|
|
<h2 id=string-module>The <code>string</code> module</h2>
|
|
|
|
<p>[FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.]
|
|
|
|
<div class=s>
|
|
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
|
|
</div>
|
|
|
|
<h2 id=byte-arrays>Strings vs. bytes</h2>
|
|
|
|
<h2 id=py-encoding>Character encoding of Python source code</h2>
|
|
|
|
<p>Python 3 assumes that your source code — <i>i.e.</i> each <code>.py</code> file — is encoded in <abbr>UTF-8</abbr>.
|
|
|
|
<blockquote class="note compare python2">
|
|
<p><span>☞</span>In Python 2, the default encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href="http://www.python.org/dev/peps/pep-3120/">the default encoding is <abbr>UTF-8</abbr></a>.
|
|
</blockquote>
|
|
|
|
<p>If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a <code>.py</code> file to be windows-1252:
|
|
|
|
<pre><code># -*- coding: windows-1252 -*-</code></pre>
|
|
|
|
<p>Technically, the character encoding override can also be on the second line, if the first line is a <abbr>UNIX</abbr>-like hash-bang command.
|
|
|
|
<pre><code>#!/usr/bin/python3
|
|
# -*- coding: windows-1252 -*-</code></pre>
|
|
|
|
<p>For more information, consult <a href="http://www.python.org/dev/peps/pep-0263/"><abbr>PEP</abbr> 263: Defining Python Source Code Encodings</a>.
|
|
|
|
<h2 id=furtherreading>Further reading</h2>
|
|
|
|
<p>On Unicode in Python:
|
|
|
|
<ul>
|
|
<li><a href="http://docs.python.org/dev/3.0/howto/unicode.html">Python Unicode HOWTO</a>
|
|
<li><a href="http://docs.python.org/dev/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit">What's New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit</a>
|
|
</ul>
|
|
|
|
<p>On Unicode in general:
|
|
|
|
<ul>
|
|
<li><a href="http://www.joelonsoftware.com/articles/Unicode.html">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>
|
|
<li><a href="http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode">On the Goodness of Unicode</a>
|
|
<li><a href="http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings">On Character Strings</a>
|
|
<li><a href="http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF">Characters vs. Bytes</a>
|
|
</ul>
|
|
|
|
<p>On character encoding in other formats:
|
|
|
|
<ul>
|
|
<li><a href="http://feedparser.org/docs/character-encoding.html">Character encoding in XML</a>
|
|
<li><a href="http://blog.whatwg.org/the-road-to-html-5-character-encoding">Character encoding in HTML</a>
|
|
</ul>
|
|
|
|
<p>On strings and string formatting:
|
|
|
|
<ul>
|
|
<li><a href="http://docs.python.org/dev/3.0/library/string.html"><code>string</code> — Common string operations</a>
|
|
<li><a href="http://docs.python.org/dev/3.0/library/string.html#formatstrings">Format String Syntax</a>
|
|
<li><a href="http://docs.python.org/dev/3.0/library/string.html#format-specification-mini-language">Format Specification Mini-Language</a>
|
|
<li><a href="http://www.python.org/dev/peps/pep-3101/"><abbr>PEP</abbr> 3101: Advanced String Formatting</a>
|
|
</ul>
|
|
|
|
<p class=c>© 2001–9 <a href=about.html><span>ℳ</span>ark Pilgrim</a>
|
|
<script src=jquery.js></script>
|
|
<script src=dip3.js></script>
|