Today I was faced with the task of having to load a massive amount of shape files into my PostGIS database. The data in question is the Advanced Digital Road Map Database (ADF) (拡張版全国デジタル道路地図データベース) by Sumitomo Electric System Solutions Co., Ltd. (住友電工システムソリューション株式会社). It contains very detailed information (spatial and attributive) about the road network of all Japan and is thereby quite heavy.
Therefore, it was split into a plethora of files using the following naming schema: mmmmmm_ttt.shp, where mmmmmm represents a six-digit mesh code and ttt represents a 2- to 3-digit thematic code. The mesh code is a result of the data being split spatially into small, rectangular chunks. It follows a simple logic, whereby bigger mesh units (represented by the first four digits) are further subdivided into smaller units (represented by the last two digits). It took only a small amount of time to figure out this naming schema and filter the files that would be necessary for my analysis.
Basically I wanted to merge the shape files into PostGIS tables divided by their topic (i.e. road nodes, road links, additional attribute information, etc.). So I had to find a way to batch import the shape files into PostGIS and merge them at the same time. Yet, since the node IDs were only unique within each mesh unit (i.e. shape file), I also had to find a way to incorporate the mesh codes themselves into the data, so I could later on create my own ID schema for the nodes, based on the mesh code and the original node ID (e.g. mmmmmmnnnnn, where mmmmmm represents a six-digit mesh code and nnnnn represents the original 5-digit node ID).
Preparing the Data
My first endeavor, hence, was to find a way to automate this process of adding a new field to all the shape files and enter the respective mesh code. I found an excellent tool on the ESRI ArcScripts repository that did exactly what I wanted. It’s a neat little Python script aptly named “Filename Insert” written by Rocky Rudolph to whom all the credit is due. I just slightly modified his excellent script to better fit my needs: while the original script added the whole filename (e.g. “554160_314.shp”) to a field called “FILENAME” I filtered only the 6-ditig mesh code (e.g. “554160”) and wrote it to a new field called “filename” to follow my habit of having all small caps field names in my database. Also, I added a print out of the filename currently being processed – but that’s just because I love to see something going on in my console windows during some extensive data crunching processes.
#----------------------------------------------------------------------------------------------------------- # File Name Insert # Script written by Rocky Rudolph - April, 2006 - Channel Islands National Park, California # Purpose: for creating a field called "FILENAME" and attaching the filename of the shapefile to each entry # in the attribute table. Use with as many shapefiles within a specified directory. # Useful when picking apart shapefile entries and combining into a separate file to # maintain a breadcrumb trail of the original shapefile name. # # Run file within a directory containing all the shapfiles needing modification # Make sure FILENAME field doesn't already exist in any of the shapefiles # #----------------------------------------------------------------------------------------------------------- #import relevant modules, create geoprocessing dispatch object import win32com.client, sys, string, os gp = win32com.client.Dispatch("esriGeoprocessing.gpDispatch.1") # Remember to change this to wherever your shapefiles are stored gp.workspace = "W:\\\\Data\\adfimport" try: fcs = gp.ListFeatureClasses("*", "all") fcs.reset() fc = fcs.Next() while fc: print "Processing: " + fc # Create the new field gp.AddField_management (fc, "filename", "text", "", "", "50") # Apply the filename to all entries gp.CalculateField_management (fc, "filename", '"' + fc + '"') fc = fcs.Next() except: print gp.GetMessages ()
I had to run this script in my Windows VM (remember I’m on Mac OS X), since it makes use of some ESRI ArcGIS libraries and hence needs a licensed copy of ArcGIS. I’m sure there is an easy way to perform this task by automating QGIS in a similar way to ArcGIS, but since my main goal was fast turnaround, not a perfect workflow, I deemed this solution good enough.
The next task was to import and merge the shape files into tables in my PostGIS database. As mentioned above, I wanted them to be stored in individual tables per topic, but merged across the mesh units. After some searching on the web I found a useful script on the OpenGeo Suite User Manual Contents, that makes use of shp2pgsql and psql on the bash (“command line” in Windows speak) using two loops. That information, together with the information from the BostonGIS PostGIS 2.0 pgsql2shp shp2pgsql Command Line Cheatsheet allowed me to put together this neat little 6-liner to do exactly what I wanted:
#!/bin/bash shp2pgsql -s 4301 -g geom -W SJIS -p 523800_31.shp adf.adf31 | psql -p 5432 -h localhost -U postgres -d maindb for f in *.shp do shp2pgsql -s 4301 -g geom -W SJIS -a $f adf.adf31 | psql -p 5432 -h localhost -U postgres -d maindb done
First of all, please note that by piping I was able to accomplish the two tasks of calling shp2pgsql and psql in one line, thereby eliminating the second loop at all and hence further improving runtime. Also, please note that this simple script assumes that all the input files are using the same coordinate reference system (CRS), in my case EPSG:4301 (Tokyo Datum), provided by the -s parameter. Next, I defined the geometry column in my PostGIS database to be “geom” using the -g parameter, since I’m using this name across all my PostGIS databases. Finally, the -W parameter allowed me to tell shp2pgsql which character encoding the original shape file (more precisely its DBF attribute database) was using, in order to be able to convert it correctly to UTF-8, the encoding my PostgreSQL database uses. It actually took me some time to figure out, which one the correct encoding was, since the original dataset unfortunately is quite poorly documented. A list on the official PostgreSQL Documentation helped me greatly in narrowing down and finally identifying SJIS as the correct one in this case.
To continue, the nice thing about this script is the use of the -p and -a parameters in the two main lines of code. The parameter -p tells shp2pgsql to prepare a new table with the information found in the imported shape file without actually writing any of its contents to the database. This allowed me to prepare the database tables for the bulk import of the actual data. The -a parameter inside the loop then tells shp2pgsql to append the data from each shape file to the given database table, thereby merging all the data into one table.
Since I wanted to split the data by topic, but didn’t want to spend any more time on tweaking the script to be able to detect and handle the topics (remember: the last two to three digits of each filename, the text after the “_”), I just manually split the files by topic into individual folders on my disk and processed each topic individually, after modifying the filename schemata and database table names in the scripts accordingly. There were only six topics, so the amount of work was manageable. In order to be able to run the import over night I also quickly wrote a script that would run all the single import scripts in a row:
cd 31 sh _adfimport.sh cd .. cd 314 sh _adfimport.sh cd .. cd 317 sh _adfimport.sh cd .. cd 32 sh _adfimport.sh cd .. cd 331 sh _adfimport.sh cd .. cd 39 sh _adfimport.sh cd ..
And there I had it: six tables containing all the data for my area of interest, merged together spatially but separated by topic in my PostGIS database. There will of course have to be some more data processing steps following this, but I wanted to take the opportunity to concisely write down the process of batch importing and merging a large number of shape files into a PostGIS database – both for my own records and in order to be of any help to people facing this or a similar task in the future.