The Power of Vectorization in R

I have always been a great fan and avid user of databases. They’re just so versatile, efficient, easy to use, … I found this to be true for all kinds of data, small and large, high-dimensional and low-dimensional, spatial, temporal, you name it. It was only very recently that my data seemed to have outgrown my PostgreSQL database. Not so much in size, but rather in performance.

For two ongoing research projects I’m working with quite large (no, I’m not going to use the “b” word!) data sets. They contain recorded, or rather synthesized, movement information of a large number of people for a variety of cities, which I’m analyzing in multiple ways. How and what for shall be the topic of one or multiple future articles here. To give you an idea of the dimensions we’re talking about here, let me briefly introduce them to you:

Dhaka: 42,411 people
Hanoi: 58,018 people
Jakarta: 297,043 people

Every data set contains data about the point position for each individual in one minute intervals over a 24-hour sampling period (plus data for trips that started before but ended after midnight). In addition it provides a number of attributes such as the purpose of each trip, and the current mode of transportation, as well as socio-demographic information like sex, age, and occupation. They have all been collected, synthesized, and made available by the great folks at the Center for Spatial Information Science (CSIS) at the University of Tokyo (and are used by me as part of joint research projects 405 and 536).

As can be easily seen, some of these data sets are quite massive, particularly Jakarta. So far these data sets had been living in my PostgreSQL database, but recently quickly approaching deadlines required me to do something about the increasingly long processing times I experienced with these data. I had originally made most of my analyses inside the PostgreSQL database itself, since this promised to be the most hassle-free and efficient way to do things. Yet, over time I had adopted R as my main analytic tool and noticed that I used the database exclusively as a storage repository. So I thought, why not move the data into R as well?

As anybody who had to do with R in the past will know, all data necessary for your computations needs to be stored in memory completely. This can be both a curse and a blessing! Personally I started by focusing on the blessing part, which is the performance gain I hoped to be able to achieve by processing data from memory instead of a disk-based database. The curse part of how to deal with those data sets too big for my main memory (modest 32GB) would have to wait for after I had established if this path would really bring significant improvements in performance.

In the following I will compare the two different paths of getting to the same result, i.e. being able to analyze the data from within R: using PostgreSQL and using just R. Both require the data, which is delivered as CSV files (one per minute of the sample period, numbered from 0000 to 2359 and a bit further as needed) to be extracted from ZIP archives. Being a Mac OS user I did that with a simple one-liner on the bash:

unzip '/path/to/zip/files/dataset_*.zip' -d .

PostgreSQL

After a quick check of the integrity of the new CSV file I then needed to generate the import script for the database. This can also be done easily on the bash:

(for FILE in /path/to/csv/files/dataset_*.csv; do echo "COPY data.point (pid, tno, sno, pdate, longitude, latitude, sex, age, padd, occup, purpose, magfac, magfac2, tmode) FROM '$FILE' DELIMITER ',' CSV;"; done) > import_data.sql

The database table data.point referenced in these import statements obviously needs to be created prior to running the SQL file of import statements generated by the for loop in the code above (together with some database internal housekeeping like sequences and authorizations):

CREATE SCHEMA movement 
  AUTHORIZATION postgres;
CREATE SEQUENCE data.point_id_seq 
  INCREMENT 1 
  MINVALUE 1 
  MAXVALUE 9223372036854775807 
  START 1 
  CACHE 1;
ALTER TABLE data.point_id_seq 
  OWNER TO postgres;
CREATE TABLE data.point 
(
  id bigint NOT NULL DEFAULT nextval('data.point_id_seq'::regclass), 
  pid integer, 
  tno smallint, 
  sno smallint, 
  pdate timestamp without time zone, 
  longitude double precision, 
  latitude double precision, 
  sex smallint, 
  age smallint, 
  padd character(8), 
  occup smallint, 
  purpose smallint, 
  magfac smallint, 
  magfac2 smallint, 
  tmode smallint, 
  geom geometry(Point, 4326), 
  puid bigint
)
WITH (
  OIDS=FALSE
);
ALTER TABLE data.point 
  OWNER TO postgres;

This then allows us to run the series of import statements generated earlier. I measured the runtime necessary for the three data sets as follows:

Dhaka: 6 minutes
Hanoi: 7 minutes
Jakarta: 40 minutes

In order to be able to work with these point locations I also needed to create a new primary index, which I named puid for “point unique id”, as well as their spatial location as PostGIS geometry, calculated from the latitude and longitude provided in the original data:

UPDATE data.point 
SET puid = id, 
  geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);

Here are the run times for this query:

Dhaka: 6 minutes
Hanoi: 9 minutes
Jakarta: 50 minutes

Finally (at least for the purpose of this comparison) I wanted to calculate the straight-line distance between the point positions in their sequential order per person (pid):

UPDATE data.point t1
SET dist = 
(
  SELECT ST_Distance_Sphere(t1.geom, t2.geom) distance
  FROM data.point t2 
  WHERE t1.pid = t2.pid AND t1.tno = t2.tno AND t1.sno = t2.sno AND t2.pdate = (t1.pdate + interval '1 minute') 
);

This query results in the following execution plan (here for the Dhaka data set):

Update on point t1  (cost=0.00..65207662366.13 rows=121848040 width=78)
  ->  Seq Scan on point t1  (cost=0.00..65207662366.13 rows=121848040 width=78)
        SubPlan 1
          ->  Index Scan using idx_pid_data_point on point t2  (cost=0.00..535.13 rows=1 width=32)
                Index Cond: (t1.pid = pid)
                Filter: ((t1.tno = tno) AND (t1.sno = sno) AND (pdate = (t1.pdate + '00:01:00'::interval)))

As you can see I earlier also created an index over the table of point locations using the personal identifier pid. Even so, the execution times were horribly long:

Dhaka: 66 hours
Hanoi: 119 hours
Jakarta: 53 hours

Apart from the inexplicable fact that calculating the segment distances for the Jakarta data set took less than half as long as for the Hanoi data set, even though it is almost 5 times as large, it is obvious that these calculation times are very very long. There must be a better way to do this!

R

Cue R. For the data to be read into R I decided it was easiest to merge the single CSV files into one large CSV file containing the complete data set. Again, I did that with a simple one-liner on the bash:

cat /path/to/csv/files/dataset_*.csv > /path/to/csv/files/dataset.csv

On my system this took a little while:

Dhaka: 1 minute
Hanoi: 1.5 minutes
Jakarta: 7 minutes

In order to be able to work with such large tables in R I resorted to the data.table package. As a nice benefit this also provides us with the fread function as the (to my best knowledge) fastest option to read data from disk to memory:

library(data.table)
# load data set from CSV
data <- fread("/path/to/csv/files/dataset.csv", 
              sep = ",", 
              header = FALSE, 
              na.strings = c(""), 
              stringsAsFactors = FALSE, 
              verbose = TRUE)

This generated the following output (here for the Jakarta data set):

Input contains no \n. Taking this to be a filename to open
File opened, filesize is 31.594 GB
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 30 (the last non blank line in the first 'autostart') ... found ok
Found 14 columns
First row with 14 fields occurs on line 1 (either column names or first row of data)
'header' changed by user from 'auto' to FALSE
Count of eol after first data row: 430452562
Subtracted 1 for last eol and any trailing empty lines, leaving 430452561 data rows
Type codes: 11143311111111 (first 5 rows)
Type codes: 11143311111111 (+middle 5 rows)
Type codes: 11143311111111 (+last 5 rows)
Type codes: 11143311111111 (after applying colClasses and integer64)
Type codes: 11143311111111 (after applying drop or select (if supplied)
Allocating 14 column slots (14 - 0 NULL)
Read 430452561 rows and 14 (of 14) columns from 31.594 GB file in 00:06:38
   0.001s (  0%) Memory map (rerun may be quicker)
   0.001s (  0%) sep and header detection
 119.799s ( 30%) Count rows (wc -l)
   0.002s (  0%) Column type detection (first, middle and last 5 rows)
   8.044s (  2%) Allocation of 430452561x14 result (xMB) in RAM
 269.749s ( 68%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.259s (  0%) Changing na.strings to NA
 397.854s        Total

And here are the execution times:

Dhaka: 42 seconds
Hanoi: 59 seconds
Jakarta: 6 minutes

The avid observer will have picked up that the Jakarta data set amounts to 31.594 GB in total – which unfortunately is more than my current machine offers in RAM. The result: swapping. Which is not only slowing everything down, but also lead to a total crash two times during my trials. I will shed some light in a future article on how I solved this problem. For now, I only focus on the execution performance of PostgreSQL (disk-based) versus R (memory-based).

In order to be able to process the data in the same way as I mentioned earlier in the context of the PostgreSQL database, I then also provided the data some meaningful column names, a unique point id puid and created an index over the data set:

# provide meaningful column names
setnames(data, c("pid", "tno", "sno", "pdate", "lon", "lat", "sex", "age", 
                 "padd", "occup", "purpose", "magfac", "magfac2", "tmode"))

# add puid as unique point identifier
data[, puid := seq_len(nrow(data))]

# order data by pid and puid
keyColumns <- c("pid", "pdate")
setkeyv(data, keyColumns)

The corresponding execution times:

Dhaka: 16 seconds
Hanoi: 22 seconds
Jakarta: 48 minutes

Again, notice the exponentially longer calculation time for the Jakarta data set, which can be attributed to the swapping necessary since the data set is too large to fit completely in memory.

This leaves us with the last and most interesting step (and the one that I was originally planning to write about in this article): the calculation of the segment distances between the point positions of each single individual. Please remember that in PostgreSQL this took from 53 to 119 hours! I have to admit that it took me quite a while to leave my trusted for loop behind and fully embrace the concept of vectorization that makes R so powerful.

for Loop

This is the R code I used to calculate the distances:

# extract distinct pids
pids <- data[, length(puid), by = pid]

# create empty vector for point distances
distances <- numeric(0)

system.time(for(p in pids$pid) {
  # extract one person
  person <- data[J(p)]
  # order sequentially (by puid)
  setkey(person, puid)
  # extract coordinates
  coords <- person[, list(lon, lat)]
  # calculate distance matrix
  dist <- distm(coords)
  # remove first column of distance matrix
  dist <- dist[-1, ]
  # extract sequential point distances from matrix diagonal
  dist <- c(diag(dist), 0)
  # append point distances to overall point distance vector
  distances <- c(distances, dist)
})

# assign point distances to data set
data[, dist := distances]

First I needed to extract all unique pids (line 2). The code in lines 7 to 22 then represents the actual for loop. It goes through the aforementioned vector of unique pids one by one and calculates a distance matrix (line 15) between all point locations of the respective person. This results in a 1,440-by-1,440 matrix (remember: one position per minute, i.e. 24 hours * 60 minutes = 1,440 rows). As can be easily seen this is also highly ineffective, since we’re not interested in the distance matrix between all the point locations, but only in the distances of consecutive point locations. To extract these I shift the distance matrix one row up (line 17). Now the distances we’re looking for are in the diagonal of the matrix and can be easily extracted and saved as a vector (line 19). All segment distances are then put into a vector distances (line 21), which can ultimately be added to the original data.table after the calculation has finished (line 25).

In the case of the Dhaka data set this takes ca. 10 hours in total.

Vectorization

Vectorization means the idea of applying a function to all elements of a vector instead of looping through the single elements one by one. In the case of our segment distance calculation problem this looks as follows:

pointDistances <- function (df, lon, lat) {
  # extract point coordinates
 pts <- df[, list(lon, lat)]
  # calculate distances between shifted point positions
 dist <- distHaversine(p1 = pts[-nrow(df),], 
                        p2 = pts[-1,])
 # append 0 distance to last point and return
  return(c(dist, as.numeric(0)))
}

system.time(distances <- pointDistances(data, lon, lat))

# assign point distances to data set
system.time(data[, dist := distances])

The distance calculation itself lives in the function pointDistances, which I defined in lines 1 to 9. It works both on data.frames and data.tables, by extracting the columns containing the longitude and latitude information and calculating the distance between each row and the following row. Obviously the data source needs to be sorted in the correct order beforehand, as was the case for the for loop approach. The output is again saved in a vector (line 11), which is ultimately added to the original data (line 14).

For the Dhaka data set the execution took only 68 seconds in total! Please remember, using a traditional for loop it took 10 hours. So the vectorized approach took only 0.18% of the for loop approach! Or, in other words, the for loop took 552 times as long as the vectorized code. Granted, it took me some time to understand, accept, and implement the concept of vectorized calculations in R, but seeing this result it was well worth my time.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.