For my research I’m recently doing a lot of graph and network analysis. So far the tools in ESRI ArcGIS have been sufficient for what I was trying to achieve. I used their Network Analyst Extension and also the Urban Network Analysis toolbox by the City Form Lab at the MIT/SUTD for a more scientific application: the calculation of centrality measures. While these warrant some more in-depth articles in themselves, here I’d like to put a more technical focus on a really annoying problem when using Gephi on a more recent MacOS X system.
Today I was faced with the task of having to load a massive amount of shape files into my PostGIS database. The data in question is the Advanced Digital Road Map Database (ADF) (拡張版全国デジタル道路地図データベース) by Sumitomo Electric System Solutions Co., Ltd. (住友電工システムソリューション株式会社). It contains very detailed information (spatial and attributive) about the road network of all Japan and is thereby quite heavy.
Therefore, it was split into a plethora of files using the following naming schema: mmmmmm_ttt.shp, where mmmmmm represents a six-digit mesh code and ttt represents a 2- to 3-digit thematic code. The mesh code is a result of the data being split spatially into small, rectangular chunks. It follows a simple logic, whereby bigger mesh units (represented by the first four digits) are further subdivided into smaller units (represented by the last two digits). It took only a small amount of time to figure out this naming schema and filter the files that would be necessary for my analysis.
Basically I wanted to merge the shape files into PostGIS tables divided by their topic (i.e. road nodes, road links, additional attribute information, etc.). So I had to find a way to batch import the shape files into PostGIS and merge them at the same time. Yet, since the node IDs were only unique within each mesh unit (i.e. shape file), I also had to find a way to incorporate the mesh codes themselves into the data, so I could later on create my own ID schema for the nodes, based on the mesh code and the original node ID (e.g. mmmmmmnnnnn, where mmmmmm represents a six-digit mesh code and nnnnn represents the original 5-digit node ID).
Being a researcher in Japan means I often have to work with Japanese data. While generally data is data is data, there are some peculiarities I came across that seem to be related to the fact that those data are about and produced in Japan.
Firstly there is the way they are delivered. I’m no so much talking about deliveries on “hard media” such as CD-ROMs and DVDs being snail-mailed, even though this seems to be the major way of obtaining data until this day. Luckily I’m embedded in an ecosystem of research institutions and university laboratories that engage in joint research projects and thereby share the necessary datasets online using portal websites. I’d especially like to mention the JoRAS portal of the Center for Spatial Information Science (CSIS) at the University of Tokyo (東京大学) here, since their stock is quite extensive and they are always open for collaboration inquiries.
Secondly there is the fact that, not very surprising, Japanese datasets often contains Japanese data. By this I’m not referring to the fact that this data is dealing with information about Japan, but to the fact that it is making use of Japanese script. This introduces some technical difficulties, which I would like to elucidate in this article.