bor.borygmus

A programming weblog by Hao Lian. • A long walk through an angry forest. • A series of memory leaks brought on by senility.

2to3 and unittest have likely done untold wonders in keeping together the families of library authors porting from Python 2 to Python 3, taking care of the low-hanging fruit and leaving the most insidious and therefore fun bugs to root out. The biggest category of these would be Unicode, and out of that category, the biggest subcategory would be relying on some quirk of representing textual data in bytes when it should have been Unicode all along.

But then there’s this: What if you need to represent bytes? Take hashlib, which will flat-out refuse to consume a Unicode string. There’s a byte literal in Python 3, but is there a way to get there from 2to3 and Python 2?

A detour: For 2to3 library compatibility, your source code should (1) completely work in Python 2; and (2) completely work in Python 3 after I run 2to3 on your source directory. Some libraries have opted out—that is, cheated—and asked the user to patch the source with a special diff after installation. Cough, BeautifulSoup. But I say foo foo to that.

Back to the matter at hand: Representing bytes in Python 2 is easy.

import hashlib
hashlib.sha224('cuckoo').hexdigest()

2to3 will take this and produce

# post-2to3
# Remember, string literals are now Unicode.
import hashlib
hashlib.sha224('cuckoo').hexdigest()

Fine, you say, I’ll just make it extra explicit. Surely 2to3 is smart enough to recognize bytes in this:

hashlib.sha224(str('cuckoo')).hexdigest()

2to3 is not:

# post-2to3
# Remember, string literals are now Unicode.
hashlib.sha224(unicode('cuckoo')).hexdigest()

The problem is worth a reformulation: In 2to3-land, there is only one type of string literal. Both the lonely 'cuckoo' and the u'cuckoo' map to Unicode in Python 3. How do you map to a byte literal then using only these tools?

The best I’ve come up with is to dirty the code with this trick:

hashlib.sha224(u'cuckoo'.encode('ascii')).hexdigest()

After 2to3, it becomes:

# post-2to3
hashlib.sha224('cuckoo'.encode('ascii')).hexdigest()

Which, for all intents and purposes, is equivalent to this modulo a performance hit.

# post-2to3
hashlib.sha224(b'cuckoo').hexdigest()

Downsides: it’s ugly, and it depends on your file encoding for anything beyond ASCII. (File encoding is the encoding you’ve written your source code in. Unless your job security is particularly lacking or your editor is extremely poor, you should choose UTF-8. UTF-16 maybe.) This trick scales up: encode('[your encoding here]') instead of encode('ascii') for these more complicated bytes.

You may already notice a problem: what if the bytes you’re trying to represent aren’t textual? What if you just want the byte corresponding to 128 and don’t want to jump through the hoops of decoding it to figure out what to encode? Or what if it doesn’t decode to your file encoding because it’s invalid to do so: Picture an imaginary Unicode encoding that writes two bytes for every Unicode character such that the two bytes add up to a prime number. (Note to self: awesome.)

As a concrete example, take BeautifulSoup’s code here:

smart_quotes_re = "([\x80-\x9f])"
smart_quotes_compiled = re.compile(smart_quotes_re)

which looks like this after 2to3:

# post-2to3 (quotes are Unicode now)
smart_quotes_re = "([\x80-\x9f])"
smart_quotes_compiled = re.compile(smart_quotes_re)

BeautifulSoup uses a diff to worm its way out of this. (In fairness, this occurs many times in their code, and doing this 80 times in 80 lines of code gets ugly fast.) But can we find a solution that works with 2to3? Can we, against all odds and standing on this cliff in front of a glorious sunset, conquer the pervasive, sticky, evil that has overcome the triumphant splendor that is the Python kingdom, that which we have constructed with our bare goodly hands to stand the regimen of time and entropy?

That is, is there a general encoding in Python, which Python will call a “codec”, that can represent all bytes without ever complaining? Yes, Eugene, there is. Hidden way beneath a dusty, oft-overlooked documentation page are these three words: raw_unicode_escape.

You see, that Python even has a Unicode literal is a pretty outlandish decision. If you think about it, most other languages don’t: C, PHP, Perl, Ruby, Classical Latin. In these languages, either the designer or the community chooses a default encoding—usually UTF-8. You can convert from UTF-8 to UTF-16 and then back, but there’s never a blessed intermediate. Although in some weird and wild languages you can create one—and then promptly not document it by assuming that API documentation generated from comments plus freely available source code are anyone ever needs. Python has one, and as the codecs module pointed out, it’s an encoding in its own right. And it’s a pretty tame encoding: each Unicode code-point (character) is assigned a number by the Unseen Unicode Gods. These UUG numbers vary wildly and have no standardized bound—hence the problem. When you create an array in memory, you need to know how big each element is in order to achieve any efficiency in low-level languages. In a high-level language, you can do what Python does: make an array data structure that doesn’t care how big each code-point is and call it a Unicode string and give it a Unicode literal. (Aside: these statements have never been backed by a look into the Python source code and should in all respects be treated as a fun metaphor for what really happens.)

Back to the task at hand: how do you get the hexadecimally represented byte 0 × 80 when you only have Unicode literals to work with? Simple:

smart_quotes_re = u"([\x80-\x9f])".encode('raw_unicode_escape')
smart_quotes_compiled = re.compile(smart_quotes_re)

After 2to3:

# post-2to3.
smart_quotes_re = "([\x80-\x9f])".encode('raw_unicode_escape')
smart_quotes_compiled = re.compile(smart_quotes_re)

# Or, equivalently:
smart_quotes_re = b"([\x80-\x9f])"
smart_quotes_compiled = re.compile(smart_quotes_re)

Note that the unicode_escape codec won’t work: the backslashes will get encoded because unicode_escape thinks we want backslashes.

And we can quickly verify this fegnomeeknown in the interpreter:

> python2.6
>>> u'\x80'.encode('raw_unicode_escape') == '\x80'
True
>>> ^D
> python3.1
>>> '\x80'.encode('raw_unicode_escape') == b'\x80'
True
>>> ^D

No diffs. Hooray!

[(July 4, 2009) .]

Abandon your ideas.

Use Markdown+, but not HTML. In code blocks, beware angle brackets.