Best Practices

Merge vs Correlated Join

0

This post will examine the performance of MERGE versus using a correlated UPDATE statement using different projection states and different load methods. These statements will be used to update values in a target table based on values from a source table.

Background

The MERGE operation has the flexibility to perform INSERT, UPDATE, and DELETE in a single operation but has more explicit conditions for an optimized plan. The UPDATE operation is typically used to update values in a single table, but can be joined with another table to update values from another table. Both operations use WOS (memory) by default and overflow to ROS (disk) when the WOS is full and both accept the /*+direct*/ hint to write data directly to ROS, bypassing WOS.

MERGE Limitations 1
  • An optimized query plan for a MERGE statement assumes that the data you are trying to merge conforms with any unique or primary key constraints you declare.
  • You cannot run a MERGE operation on identity/auto-increment columns or on columns that have primary key or foreign key referential integrity constraints, as defined in CREATE TABLE Column-Constraint syntax.
  • You cannot use MERGE with unstructured tables.
  • If you run the MERGE statement more than once using the same target and source tables, you could introduce duplicate values into the join columns, such as when using constants in the UPDATE/INSERT clauses.
UPDATE Limitations 2
  • The table you specify in the UPDATE list cannot also appear in the FROM list (no self joins).
  • If the joins specified in the WHERE predicate produce more than one copy of the row in the table to be updated, the new value of the row in the table is chosen arbitrarily.
  • You cannot UPDATE columns that have primary key or foreign key referential integrity constraints.

Sample Data

The data used in this test comes from Vertica’s VMart example database. The test copies online_sales_fact and product_dimension tables to a test schema. These tables are used in this test as they have an appropriate relationship and contain an ample amount of rows to join on (5,000,000 & 60,000 rows respectively).

The tables are dropped, copied directly into ROS containers and statistics are updated for each test through scripted queries. The UPDATE and MERGE statements are run directly from vsql. For tests 2 and 3, projections are created before data is loaded.

Test Cases

For illustration, the test will set online_sales_fact.ship_dollar_amount to product_dimension.product_version using UPDATE and MERGE statements:

UPDATE
MERGE

Output is directed to /dev/null, timing is enabled, and statements not committed. Only the All rows formatted time is recorded. The test environment was running Vertica v7.0.1-0 on a 3 node cluster with Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2 cores per node) and 4 GB memory/node.

Verifying Results
The tests are confirmed by matching the sum of ship_dollar_amount to the referenced table’s product_version (after):

Before

After

 

Results

Test 1 – Default Super-projections
RunWOS (Default)ROS (Direct)
UPDATE (MS)MERGE (MS)UPDATE (MS)MERGE (MS)
17605.28722180.34713765.50429052.498
27597.49622784.56914192.25629212.594
37653.68822568.33613698.25829121.884
47722.94022757.44614081.15029170.858
57641.24023354.06613758.39229341.553
Average7644.13022728.95313899.11229179.877

Explain Plans: UPDATE · MERGE
Projections: online_sales_fact · product_dimension

Test 2 – DBD Run

This test contained projections created from a comprehensive run of the Database Designer.

RunWOS (Default)ROS (Direct)
UPDATE (MS)MERGE (MS)UPDATE (MS)MERGE (MS)
14987.76519101.6139164.28322742.732
25005.28318749.3258946.12123342.961
34985.06118705.9699636.02922628.058
44939.95418801.3249015.90722426.726
55005.46918874.6738934.52323267.508
Average4984.70618846.5819139.37322881.597

Explain Plans: UPDATE · MERGE
Projections: online_sales_fact · product_dimension

Test 3 – Replicated Projection

This test adds a replicated projection to the fact table.

RunWOS (Default)ROS (Direct)
UPDATE (MS)MERGE (MS)UPDATE (MS)MERGE (MS)
17680.74440557.08611369.37245296.583
27542.12240535.98811161.23046058.861
37436.96141259.57711635.98745660.745
47613.11140997.06311160.28846172.455
57593.47940352.09611186.69745408.047
Average7573.28340740.36211302.71545719.338

Explain Plans: UPDATE · MERGE
Projections: online_sales_fact · product_dimension

Cleanup

 

Conclusion

Throughout the tests, the EXPLAIN plan for the MERGE indicated that no statistics were present. It would appear that statistics aren’t used for the MERGE operation. All tests yielded times nearly double of an UPDATE operation. In our case, it doesn’t appear the MERGE statement met the conditions required for an optimized MERGE as detailed in the documentation. Instead, Vertica prepared a non-optimized query plan. Further, there are design considerations for optimizing projections for a MERGE that weren’t followed and a better plan may have been created. However, it feels as though MERGE would still be more expensive. In this case, MERGE is simply not fit for this type of basic operation and should be left for cases that require an INSERT, UPDATE, and DELETE.

Leaving MERGE aside, the addition of an optimized projection after running Database Designer in Comprehensive Mode in the second test decreased the UPDATE time by 42% (2659 ms). The use of the /*+direct*/ hint was purely exploratory and nearly doubled times throughout the tests. In the third test, the addition of a query specific projection for both tables on the involved columns nearly doubled times from the second test. There was added complexity in the plans for these tests as well. In a traditional star schema, the fact table may be replicated on all nodes and the dimension tables segmented. The optimal design for our experiment was a single segmented projection on the fact table, and a single replicated projection on the dimension table (the target table in the UPDATE was the fact table). It’s unlikely that an update would be performed on a fact table in a production environment, but for this post it was used to show the time on a large record set.

See also

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user's employer.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: