I have always been a great fan and avid user of databases. They’re just so versatile, efficient, easy to use, … I found this to be true for all kinds of data, small and large, high-dimensional and low-dimensional, spatial, temporal, you name it. It was only very recently that my data seemed to have outgrown my PostgreSQL database. Not so much in size, but rather in performance.
Tag / performance
The Power of Subqueries in PostgreSQL
I’m pretty sure everything I’m writing here is true for most other RDBMS, too, but since I’m currently using PostgreSQL I had a chance to test it and show some hard figures here.
The problem to solve is actually a common one and rather easy to solve conceptually: Take attributes from table1
and store them in corresponding rows in table2
, using a common id
to join them. The straight-forward (and almost direct) translation into SQL is therefore:
UPDATE schema.table1 a SET attribute = b.attribute FROM schema.table2 b WHERE a.id = b.id;
There’s nothing wrong with that statement and it’s going to do exactly what’s intended. Only, it’s not very clever, and hence not very performant. This obviously only matters if your tables are on the bigger end. In my case table1
(the one to update) has 576,806 rows while table2
(the one providing the attribute data) has a whopping 848,664,485 rows. Also I should mention that table2
contains multiple rows for each corresponding row in table1
. In that concrete case table2
contains data about point locations (latitude, longitude, timestamp) of people whose attributes (age, gender, etc…) are stored in table1
. And there is this one attribute which is wrongly and inefficiently stored with each point location, while it is only dependent on the person and should hence be stored there.