I put off converting my Python code to Python3 for the usual reasons, including lack of 3rd-party library support and perceived issues with Unicode. But not all my work are applications; a significant portion are crypto libraries. I should at least convert them to Python3 so I do not become part of the problem.
When hashing data for digital signatures the data must always have a consistent representation, be of consistent length, and have zero endian issues. Change anything, even a single bit, and the hashes will not agree. While hashing Unicode is technically possible, the community is not sufficiently well versed on how to [consistently] do this correctly. (If in doubt, serious expertise should be consulted.) Likewise, software tools are not standardized, can produce varying results, and interoperability is a major requirement.
In acknowledgement of this state of affairs, today's conventional wisdom says we should be hashing text in the lowest common denominator, ASCII text or perhaps Latin-1 single-byte encodings. KISS. Since hashes are seldom used in isolation (cryptographically), all my other crypto routines need to have consistent data passing protocols, the simpler the better. Unicode does not [easily] meet this requirement.
After a couple of half-hearted false starts I decided it best to start over, beginning with some serious homework. I chose for my first "for-real" conversion a relatively simple Blowfish crypto library, blowfish.py, and its test procedures. Here are a few lessons learned from that exercise.
Unicode may be a boring topic but do read these first.
http://diveintopython3.org/strings.html
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
http://diveintopython3.org/porting...with-2to3.html
And as usual, there's more via Google.
Converting blowfish.py took a while. Some of that was the learning curve, but in truth, getting it to work didn't take that much time. Defining simple, clean, efficient idioms, patterns, and guidelines did. Time spent here should make subsequent conversions much easier.
Perhaps not new to the community, but here is a short summary in terms that work for me.
- If textual data (perhaps reproduced from printed materials) appears in source code, is likely to be displayed or printed, or is otherwise associated with human consumption, be sure it is str (Unicode).
- If data is associated with processing, is to be passed as an argument or as a return value, or is to be communicated to other programs or stored in files for later processing, be sure they are bytes (bytestring).
- If test values are defined in hex, leave them as str for ease of importing into the code and for general readability, but convert them to bytes with .encode('latin-1') and unhexlify() before processing.
- If bytes need to be converted to str, use .decode().
- If bytes need to be converted to hex, hexlify() and convert to str with .decode().
- If doing crypto and 8-bit byte bytestrings are important, consider the 'latin-1' encoding (AKA iso-8559-1). The high-order Latin characters may not always print the same across all operating systems, but latin-1 will always provide an 8-bit byte representation for all values 0..255. (Both 'ascii' and 'utf-8' are a single byte for 0..127 but the values 128..255 get converted into a two-byte representation. Not good. Ignore 'utf-16' and 'utf-32'.)
- Be on the lookout for Unicode as the result of some default action somewhere. Databases are one common source. I worked with one DB that, unbeknownst to me, accepted byte strings (seen as ASCII characters), converted and stored them as Unicode, and returned UTF-16 when selected. The before and after hash values were very different. Consider using raw hex dumps/views when things don't make sense as ASCII text and Unicode will often print the same. The DB I was using had parameters I could set, but use extreme care if your DB is already populated with data.
Additional code changes will be necessary, some supported by 2to3 and some are very manual. Thus far the most awkward one was zip(). More on this later. After you get things working under Python3, go back and try to get backward compatibility with Python 2.6 and 2.7 by adding
from __future__ import print_function
Mod and morph as appropriate. There will be a few exceptions but do try to get the same code to work under Python2 and Python3. (Don't forget to re-test again when finished.)
Later, I ran blowfish.py with the time command. I was somewhat surprised with the results. (Subsequent re-runs provided very similar times.)
python 2.5 | 4.276s | (Python2 code, 32-bit w/o psyco)
|
python 2.5 | 1.800s | (Python2 code, 32-bit w/psyco)
|
python 2.6 | 2.655s | (64-bit)
|
python 2.7 | 2.705s | (64-bit)
|
pypy (2.7) | 2.192s | (64-bit)
|
python 3.2 | 1.783s | (64-bit) |
I expect the results to be skewed even further when I encrypt larger data.
Labels: crypto, python