I’m pretty sure everything I’m writing here is true for most other RDBMS, too, but since I’m currently using PostgreSQL I had a chance to test it and show some hard figures here.
The problem to solve is actually a common one and rather easy to solve conceptually: Take attributes from table1
and store them in corresponding rows in table2
, using a common id
to join them. The straight-forward (and almost direct) translation into SQL is therefore:
UPDATE schema.table1 a SET attribute = b.attribute FROM schema.table2 b WHERE a.id = b.id;
There’s nothing wrong with that statement and it’s going to do exactly what’s intended. Only, it’s not very clever, and hence not very performant. This obviously only matters if your tables are on the bigger end. In my case table1
(the one to update) has 576,806 rows while table2
(the one providing the attribute data) has a whopping 848,664,485 rows. Also I should mention that table2
contains multiple rows for each corresponding row in table1
. In that concrete case table2
contains data about point locations (latitude, longitude, timestamp) of people whose attributes (age, gender, etc…) are stored in table1
. And there is this one attribute which is wrongly and inefficiently stored with each point location, while it is only dependent on the person and should hence be stored there.
Now, let’s have look at the query plan PostgreSQL suggests, to see how ineffective it really is:
maindb=# EXPLAIN UPDATE schema.table1 a SET attribute = b.attribute FROM schema.table2 b WHERE a.id = b.id; QUERY PLAN ----------------------------------------------------------------------------------- Update on table1 a (cost=16652.14..55456793.21 rows=848661632 width=58) -> Hash Join (cost=16652.14..55456793.21 rows=848661632 width=58) Hash Cond: (b.id = a.id) -> Seq Scan on table2 b (cost=0.00..35284427.32 rows=848661632 width=14) -> Hash (cost=9442.06..9442.06 rows=576806 width=48) -> Seq Scan on table1 a (cost=0.00..9442.06 rows=576806 width=48) (6 rows)
Yep, that looks busy and not like it’s going to finish fast… I actually did a timed test run with only the first 100 99 rows of table1
(a.id < 100
). Notice that the query plan changed lightly to account for this limitation:
maindb=# EXPLAIN ANALYZE UPDATE schema.table1 a SET attribute = b.attribute FROM schema.table2 b WHERE a.id = b.id AND a.pid < 100; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Update on table1 a (cost=12087.12..46612007.77 rows=282887701 width=58) (actual time=1038204.152..1038204.152 rows=0 loops=1) -> Hash Join (cost=12087.12..46612007.77 rows=282887701 width=58) (actual time=403911.498..1037959.349 rows=143228 loops=1) Hash Cond: (b.id = a.id) -> Seq Scan on table2 b (cost=0.00..35284427.32 rows=848661632 width=14) (actual time=403865.851..974695.516 rows=848664485 loops=1) -> Hash (cost=9683.76..9683.76 rows=192269 width=48) (actual time=45.503..45.503 rows=96 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 6kB -> Bitmap Heap Scan on table1 a (cost=3606.40..9683.76 rows=192269 width=48) (actual time=45.458..45.470 rows=96 loops=1) Recheck Cond: (id < 100) -> Bitmap Index Scan on pkey_id_schema_table1 (cost=0.00..3558.33 rows=192269 width=0) (actual time=45.446..45.446 rows=96 loops=1) Index Cond: (id < 100) Total runtime: 1038204.200 ms (11 rows)
More than 17 minutes is a long time to update 99 rows. Too long! Luckily salvation is around the corner. It comes in a very slick and efficient form: a subquery. Often overlooked it can improve query (and update) statements like ours significantly, without making the code the least bit more complicated:
UPDATE schema.table1 a SET attribute = ( SELECT attribute FROM schema.table2 b WHERE a.id = b.id LIMIT 1 );
The LIMIT 1
is owed to the aforementioned fact that table2
contains multiple rows for each row in table1
. It’s necessary because the SET
statement expects only a single value, so we have to make sure the subquery returns exactly one record. And this is how it performs in the real world example:
maindb=# EXPLAIN ANALYZE UPDATE schema.table1 a SET attribute = (SELECT attribute FROM schema.table2 b WHERE a.id = b.id LIMIT 1) WHERE a.id < 100; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Update on table1 a (cost=3606.80..796109.68 rows=192321 width=50) (actual time=206.382..206.382 rows=0 loops=1) -> Bitmap Heap Scan on table1 a (cost=3606.80..796109.68 rows=192321 width=50) (actual time=81.675..156.959 rows=96 loops=1) Recheck Cond: (id < 100) -> Bitmap Index Scan on pkey_id_schema_table1 (cost=0.00..3558.72 rows=192321 width=0) (actual time=0.025..0.025 rows=192 loops=1) Index Cond: (id < 100) SubPlan 1 -> Limit (cost=0.00..4.09 rows=1 width=2) (actual time=1.632..1.632 rows=1 loops=96) -> Index Scan using idx_id_schema_table2 on table2 b (cost=0.00..6428.10 rows=1572 width=2) (actual time=1.631..1.631 rows=1 loops=96) Index Cond: (a.id = id) Total runtime: 206.414 ms (10 rows)
Wow, we’re down to 206 milliseconds now. That’s just 0.01988% of the previous execution time, or a speed-up by factor 5030! That can definitely be labeled “performance tweaking”. So next time you have a query involving two large tables, think about employing a subquery. This works with both SELECT
s and (as shown) UPDATE
s, but should also work with DELETE
s – please let me know in the comments if you have any experiences here.
That’s great helped me in migrating huge table ! Thanks