I have always been a great fan and avid user of databases. They’re just so versatile, efficient, easy to use, … I found this to be true for all kinds of data, small and large, high-dimensional and low-dimensional, spatial, temporal, you name it. It was only very recently that my data seemed to have outgrown my PostgreSQL database. Not so much in size, but rather in performance.
For two ongoing research projects I’m working with quite large (no, I’m not going to use the “b” word!) data sets. They contain recorded, or rather synthesized, movement information of a large number of people for a variety of cities, which I’m analyzing in multiple ways. How and what for shall be the topic of one or multiple future articles here. To give you an idea of the dimensions we’re talking about here, let me briefly introduce them to you:
Dhaka: 42,411 people Hanoi: 58,018 people Jakarta: 297,043 people
Every data set contains data about the point position for each individual in one minute intervals over a 24-hour sampling period (plus data for trips that started before but ended after midnight). In addition it provides a number of attributes such as the purpose of each trip, and the current mode of transportation, as well as socio-demographic information like sex, age, and occupation. They have all been collected, synthesized, and made available by the great folks at the Center for Spatial Information Science (CSIS) at the University of Tokyo (and are used by me as part of joint research projects 405 and 536).
As can be easily seen, some of these data sets are quite massive, particularly Jakarta. So far these data sets had been living in my PostgreSQL database, but recently quickly approaching deadlines required me to do something about the increasingly long processing times I experienced with these data. I had originally made most of my analyses inside the PostgreSQL database itself, since this promised to be the most hassle-free and efficient way to do things. Yet, over time I had adopted R
as my main analytic tool and noticed that I used the database exclusively as a storage repository. So I thought, why not move the data into R as well?
As anybody who had to do with R in the past will know, all data necessary for your computations needs to be stored in memory completely. This can be both a curse and a blessing! Personally I started by focusing on the blessing part, which is the performance gain I hoped to be able to achieve by processing data from memory instead of a disk-based database. The curse part of how to deal with those data sets too big for my main memory (modest 32GB) would have to wait for after I had established if this path would really bring significant improvements in performance.
In the following I will compare the two different paths of getting to the same result, i.e. being able to analyze the data from within R
: using PostgreSQL and using just R
. Both require the data, which is delivered as CSV
files (one per minute of the sample period, numbered from 0000
to 2359
and a bit further as needed) to be extracted from ZIP
archives. Being a Mac OS user I did that with a simple one-liner on the bash
:
unzip '/path/to/zip/files/dataset_*.zip' -d .
PostgreSQL
After a quick check of the integrity of the new CSV
file I then needed to generate the import script for the database. This can also be done easily on the bash
:
(for FILE in /path/to/csv/files/dataset_*.csv; do echo "COPY data.point (pid, tno, sno, pdate, longitude, latitude, sex, age, padd, occup, purpose, magfac, magfac2, tmode) FROM '$FILE' DELIMITER ',' CSV;"; done) > import_data.sql
The database table data.point
referenced in these import statements obviously needs to be created prior to running the SQL
file of import statements generated by the for
loop in the code above (together with some database internal housekeeping like sequences and authorizations):
CREATE SCHEMA movement AUTHORIZATION postgres; CREATE SEQUENCE data.point_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1; ALTER TABLE data.point_id_seq OWNER TO postgres; CREATE TABLE data.point ( id bigint NOT NULL DEFAULT nextval('data.point_id_seq'::regclass), pid integer, tno smallint, sno smallint, pdate timestamp without time zone, longitude double precision, latitude double precision, sex smallint, age smallint, padd character(8), occup smallint, purpose smallint, magfac smallint, magfac2 smallint, tmode smallint, geom geometry(Point, 4326), puid bigint ) WITH ( OIDS=FALSE ); ALTER TABLE data.point OWNER TO postgres;
This then allows us to run the series of import statements generated earlier. I measured the runtime necessary for the three data sets as follows:
Dhaka: 6 minutes Hanoi: 7 minutes Jakarta: 40 minutes
In order to be able to work with these point locations I also needed to create a new primary index, which I named puid
for “point unique id”, as well as their spatial location as PostGIS geometry
, calculated from the latitude
and longitude
provided in the original data:
UPDATE data.point SET puid = id, geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
Here are the run times for this query:
Dhaka: 6 minutes Hanoi: 9 minutes Jakarta: 50 minutes
Finally (at least for the purpose of this comparison) I wanted to calculate the straight-line distance between the point positions in their sequential order per person (pid
):
UPDATE data.point t1 SET dist = ( SELECT ST_Distance_Sphere(t1.geom, t2.geom) distance FROM data.point t2 WHERE t1.pid = t2.pid AND t1.tno = t2.tno AND t1.sno = t2.sno AND t2.pdate = (t1.pdate + interval '1 minute') );
This query results in the following execution plan (here for the Dhaka data set):
Update on point t1 (cost=0.00..65207662366.13 rows=121848040 width=78) -> Seq Scan on point t1 (cost=0.00..65207662366.13 rows=121848040 width=78) SubPlan 1 -> Index Scan using idx_pid_data_point on point t2 (cost=0.00..535.13 rows=1 width=32) Index Cond: (t1.pid = pid) Filter: ((t1.tno = tno) AND (t1.sno = sno) AND (pdate = (t1.pdate + '00:01:00'::interval)))
As you can see I earlier also created an index over the table of point locations using the personal identifier pid
. Even so, the execution times were horribly long:
Dhaka: 66 hours Hanoi: 119 hours Jakarta: 53 hours
Apart from the inexplicable fact that calculating the segment distances for the Jakarta data set took less than half as long as for the Hanoi data set, even though it is almost 5 times as large, it is obvious that these calculation times are very very long. There must be a better way to do this!
R
Cue R
. For the data to be read into R
I decided it was easiest to merge the single CSV
files into one large CSV
file containing the complete data set. Again, I did that with a simple one-liner on the bash
:
cat /path/to/csv/files/dataset_*.csv > /path/to/csv/files/dataset.csv
On my system this took a little while:
Dhaka: 1 minute Hanoi: 1.5 minutes Jakarta: 7 minutes
In order to be able to work with such large tables in R
I resorted to the data.table
package. As a nice benefit this also provides us with the fread
function as the (to my best knowledge) fastest option to read data from disk to memory:
library(data.table) # load data set from CSV data <- fread("/path/to/csv/files/dataset.csv", sep = ",", header = FALSE, na.strings = c(""), stringsAsFactors = FALSE, verbose = TRUE)
This generated the following output (here for the Jakarta data set):
Input contains no \n. Taking this to be a filename to open File opened, filesize is 31.594 GB File is opened and mapped ok Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 'autostart') ... found ok Found 14 columns First row with 14 fields occurs on line 1 (either column names or first row of data) 'header' changed by user from 'auto' to FALSE Count of eol after first data row: 430452562 Subtracted 1 for last eol and any trailing empty lines, leaving 430452561 data rows Type codes: 11143311111111 (first 5 rows) Type codes: 11143311111111 (+middle 5 rows) Type codes: 11143311111111 (+last 5 rows) Type codes: 11143311111111 (after applying colClasses and integer64) Type codes: 11143311111111 (after applying drop or select (if supplied) Allocating 14 column slots (14 - 0 NULL) Read 430452561 rows and 14 (of 14) columns from 31.594 GB file in 00:06:38 0.001s ( 0%) Memory map (rerun may be quicker) 0.001s ( 0%) sep and header detection 119.799s ( 30%) Count rows (wc -l) 0.002s ( 0%) Column type detection (first, middle and last 5 rows) 8.044s ( 2%) Allocation of 430452561x14 result (xMB) in RAM 269.749s ( 68%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.259s ( 0%) Changing na.strings to NA 397.854s Total
And here are the execution times:
Dhaka: 42 seconds Hanoi: 59 seconds Jakarta: 6 minutes
The avid observer will have picked up that the Jakarta data set amounts to 31.594 GB
in total – which unfortunately is more than my current machine offers in RAM. The result: swapping. Which is not only slowing everything down, but also lead to a total crash two times during my trials. I will shed some light in a future article on how I solved this problem. For now, I only focus on the execution performance of PostgreSQL (disk-based) versus R
(memory-based).
In order to be able to process the data in the same way as I mentioned earlier in the context of the PostgreSQL database, I then also provided the data some meaningful column names, a unique point id puid
and created an index over the data set:
# provide meaningful column names setnames(data, c("pid", "tno", "sno", "pdate", "lon", "lat", "sex", "age", "padd", "occup", "purpose", "magfac", "magfac2", "tmode")) # add puid as unique point identifier data[, puid := seq_len(nrow(data))] # order data by pid and puid keyColumns <- c("pid", "pdate") setkeyv(data, keyColumns)
The corresponding execution times:
Dhaka: 16 seconds Hanoi: 22 seconds Jakarta: 48 minutes
Again, notice the exponentially longer calculation time for the Jakarta data set, which can be attributed to the swapping necessary since the data set is too large to fit completely in memory.
This leaves us with the last and most interesting step (and the one that I was originally planning to write about in this article): the calculation of the segment distances between the point positions of each single individual. Please remember that in PostgreSQL this took from 53 to 119 hours! I have to admit that it took me quite a while to leave my trusted for
loop behind and fully embrace the concept of vectorization that makes R
so powerful.
for
Loop
This is the R code I used to calculate the distances:
# extract distinct pids pids <- data[, length(puid), by = pid] # create empty vector for point distances distances <- numeric(0) system.time(for(p in pids$pid) { # extract one person person <- data[J(p)] # order sequentially (by puid) setkey(person, puid) # extract coordinates coords <- person[, list(lon, lat)] # calculate distance matrix dist <- distm(coords) # remove first column of distance matrix dist <- dist[-1, ] # extract sequential point distances from matrix diagonal dist <- c(diag(dist), 0) # append point distances to overall point distance vector distances <- c(distances, dist) }) # assign point distances to data set data[, dist := distances]
First I needed to extract all unique pid
s (line 2). The code in lines 7 to 22 then represents the actual for
loop. It goes through the aforementioned vector of unique pid
s one by one and calculates a distance matrix (line 15) between all point locations of the respective person. This results in a 1,440-by-1,440 matrix (remember: one position per minute, i.e. 24 hours * 60 minutes = 1,440 rows). As can be easily seen this is also highly ineffective, since we’re not interested in the distance matrix between all the point locations, but only in the distances of consecutive point locations. To extract these I shift the distance matrix one row up (line 17). Now the distances we’re looking for are in the diagonal of the matrix and can be easily extracted and saved as a vector (line 19). All segment distances are then put into a vector distances
(line 21), which can ultimately be added to the original data.table
after the calculation has finished (line 25).
In the case of the Dhaka data set this takes ca. 10 hours in total.
Vectorization
Vectorization means the idea of applying a function to all elements of a vector instead of looping through the single elements one by one. In the case of our segment distance calculation problem this looks as follows:
pointDistances <- function (df, lon, lat) { # extract point coordinates pts <- df[, list(lon, lat)] # calculate distances between shifted point positions dist <- distHaversine(p1 = pts[-nrow(df),], p2 = pts[-1,]) # append 0 distance to last point and return return(c(dist, as.numeric(0))) } system.time(distances <- pointDistances(data, lon, lat)) # assign point distances to data set system.time(data[, dist := distances])
The distance calculation itself lives in the function pointDistances
, which I defined in lines 1 to 9. It works both on data.frame
s and data.table
s, by extracting the columns containing the longitude and latitude information and calculating the distance between each row and the following row. Obviously the data source needs to be sorted in the correct order beforehand, as was the case for the for
loop approach. The output is again saved in a vector (line 11), which is ultimately added to the original data (line 14).
For the Dhaka data set the execution took only 68 seconds in total! Please remember, using a traditional for
loop it took 10 hours. So the vectorized approach took only 0.18% of the for
loop approach! Or, in other words, the for
loop took 552 times as long as the vectorized code. Granted, it took me some time to understand, accept, and implement the concept of vectorized calculations in R
, but seeing this result it was well worth my time.