Working With Non-Unicode Data in Python

Being a researcher in Japan means I often have to work with Japanese data. While generally data is data is data, there are some peculiarities I came across that seem to be related to the fact that those data are about and produced in Japan.

Firstly there is the way they are delivered. I’m no so much talking about deliveries on “hard media” such as CD-ROMs and DVDs being snail-mailed, even though this seems to be the major way of obtaining data until this day. Luckily I’m embedded in an ecosystem of research institutions and university laboratories that engage in joint research projects and thereby share the necessary datasets online using portal websites. I’d especially like to mention the JoRAS portal of the Center for Spatial Information Science (CSIS) at the University of Tokyo (東京大学) here, since their stock is quite extensive and they are always open for collaboration inquiries.

Secondly there is the fact that, not very surprising, Japanese datasets often contains Japanese data. By this I’m not referring to the fact that this data is dealing with information about Japan, but to the fact that it is making use of Japanese script. This introduces some technical difficulties, which I would like to elucidate in this article.

