Pythonのエンコーディングでよくはまるのでメモ

Python温泉に来て以前貰ったReal World Haskellを読み進めています．Haskellとは関係ないですが，さっきUnicodeとかutf-8とかのエラーについてさぼてんの人や@tokibitoさんや@whosaysniさんに教えてもらったのでメモ．

Pythonのstringはbyte列であり，Unicode文字列は内部でどのような表現が使われているかプログラマが意識しなくていいというのがミソだった．どのようなbyte列であるかを指定するにはUnicode文字列を作成するとき*1や，Unicode文字列をPythonの外に出す(端末への出力やファイルへのリダイレクト)ときに指定する必要がある．
具体的には端末上では動いているけれどもファイルへリダイレクトするときにエラーがでるときがある．これは端末はUnicode文字列をどのようなバイト列で出力すればいいかを教えてくれるというのに対してファイルは教えてくれないのでデフォルトのエンコーディングが使われ(多くの場合これはascii), うまくエンコーディングができないのが原因である．

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Unicode expression is abstract
# Programmer cannot know the implementation(?)
# If you omit the 2nd line of this file,
# Python cannot know what this byte array(鈴木) means

unicode_string = u"鈴木"

# When printing, Python sends byte array according it,
# only when terminal tells its encoding
# But if the output is not terminal or terminal that cannot tell its encoding
# Python cannot know what kind of byte array it should send, and fails
print(unicode_string)


# In Python 2, byte array and string are the same
# According their encodings, their inner representation differs
utf8_bytearray = unicode_string.encode('utf-8')
sjis_bytearray = unicode_string.encode('sjis')

print(utf8_bytearray) # utf-8 terminal can print this byte array
print(sjis_bytearray) # utf-8 terminal cannot print this

print("---------")

# These are byte array. Their lengths are not the same
print("len(utf8_bytearray): %d" % len(utf8_bytearray))
print("len(sjis_bytearray): %d" % len(sjis_bytearray))

print("---------")

unicode_string_from_sjis = sjis_bytearray.decode('sjis')
unicode_string_from_utf8 = utf8_bytearray.decode('utf8')
# These are unicode string. Their lengths are the same
print("len(unicode_string_from_sjis): %d" % len(unicode_string_from_sjis))
print("len(unicode_string_from_utf8): %d" % len(unicode_string_from_utf8))

suztomo@SuzAir.local ~/srm
~/srm $ echo $LANG                                           git[branch:master]
ja_JP.UTF-8
suztomo@SuzAir.local ~/srm
~/srm $ ./encode.py                                          git[branch:master]
鈴木
鈴木
---------
len(utf8_bytearray): 6
len(sjis_bytearray): 4
---------
len(unicode_string_from_sjis): 2
len(unicode_string_from_utf8): 2
suztomo@SuzAir.local ~/srm
~/srm $ ./encode.py|hoge.txt                                 git[branch:master]
zsh: command not found: hoge.txt
Traceback (most recent call last):
  File "./encode.py", line 15, in <module>
    print(unicode_string)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

ファイルへのリダイレクトの際に「print(unicode_string)」の行でエラーが起こっている．Unicode文字列をどのようなバイト列で外部に出力すればいいかわからなくなってしまい，デフォルトのasciiのエンコーディングをやろうとしているが失敗している．

いちいち出力する文字列をutf-8を指定するのが面倒なときはPython でUTF-8, shift_jis, euc_jpなど日本語を使う方法のやり方が使えます．

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import codecs

sys.stdin  = codecs.getreader('euc_jp')(sys.stdin)
sys.stdout = codecs.getwriter('shift_jis')(sys.stdout)

for line in sys.stdin:
    print line,

*1:この例ではファイル中に"鈴木"というバイト文字列は2行目の宣言によってutf-8でエンコーディングされている文字であるとPythonは認識してくれる