Archive for the ‘python’ Category

Python unicode and the problem with the terminal

Posted 24 May 2010 — by maxhaussler
Category linux, python

There are tons of tutorials on how to use Unicode in Python. They often fail to mention the problem of terminals. This actually applies to most programming languages that I know. If you find this boring now, I can fully understand, but once you’re obliged to work with Unicode (as we do as we’re mining scientific articles here), these things get important as they can cost a lot of time to fix and concern all places where data is input and output. You can read Unicode data from the keyboard and files into variables, no problem here.

For output, basically, it’s simple:

  • To convert a normal string (with Unicode in it) to a real unicode string (a different data type), use string.decode(“utf8”) (this is counter-intuitive to me: why is the method not called “toUnicode” or “encode” ??).
  • To convert from unicode to a normal string use string.encode(“utf8”)

If you have a normal string with unicode characters in it, you cannot print it to a terminal or write it to a file, as the default encoding of terminals and files is “ASCII” in python. It will lead to a “UnicodeDecodeError: Can’t decode byte at position xxx”, as the terminal is ASCII and string is supposed to be ASCII as well, but contains the special unicode characters. Python does not know how to display these strange characters on this terminal. There are two solutions:

  • For the terminal, tell Python that you are actually using a Unicode -capable terminal, as described here . Open files with the right “codec” as described here.
  • Alternatively:  Run the “.decode(“utf8”) method on all strings and they will display like this: u’\u0150sz’ which is not great, but at least there is no error messages anymore.

If you tell python that your terminal accepts Unicode, make sure that this is true: OSX terminal does by default, gnome-terminal does (look in Terminal – Encoding), xterm does not but can be set  to display it.

I know that you’re very eager to try this out. But if you want to test this now in your terminal, remember that if you cut-and-paste this word Ősz into your python interpreter or a source code file, although it is obvious to you that this is unicode, Python does not know what you’re thinking. You cannot write

print "Ősz"

as, again, this is supposed to be 0-127 ASCII string but the first letter is a code >127. So you either have to write u’Ősz’ or use the decode function again:

print 'Ősz'.decode("utf8")