Administration

Optimizing for Merge Join

0

In an earlier post, join operations were introduced followed by hash join operations. The other operator, merge join, may sometimes be needed in situations when a spill to disk occurs. In these situations, resources are being wasted. One approach may be optimizing for merge join. In this post, a query was optimized and tested for merge join using subqueries and specific projections.

Background

The purpose of optimizing for a merge join is to avoid situations where a join spills to disk. These events can be identified by looking at execution events for spill to disk (code>JOIN_SPILLED). A merge join will never spill to disk because the query will have been allocated memory that is sufficient for the matched values to be streamed through memory. It will always be more efficient than a hash join, but not necessarily faster. If the data set is small, it’s possible that a hash join can process faster. However, the hash join will always use more memory.

To process a merge join, it is assumed that the participating projections are sorted on the join key, or the subqueries are sorted on the join key. This allows matched values to be streamed through memory rather than built out as a table. In addition, the join must be an INNER JOIN.

The data sets involved in the join must be sorted for a merge join to be possible. This can be accomplished through a sorted projection, or sorted subquery. However, in the projection, having the join key as the first column in the ORDER BY may impact the sorting and encoding for other columns in the projection. In a subquery, the data would be sorted and then passed to perform the join.

With both data sets sorted, the matched values are streamed through memory. As the sides are compared and a match is found, a tuple is created. After moving through the sorted data and exhausting all possibilities on one side of the join, no further matching is required.

To further optimize a merge join, if there is an equality predicate (direct comparison) in the query, that equality predicate will be applied first if it follows a join key. This optimization, predicate pushdown, is also possible in a hash join. The goal is to apply the equality predicate on the involved tables to pass less data into the join.

For a local optimization, one of the involved projections has to be replicated on all nodes or the projections have to be identically segmented. With a replicated projection, a full copy of the projection is available on each node, and each record will find a match in the associated segmented projection. This will allow the merge join to process locally.

Test

This test will use the VMart example database and join the online_sales_fact table to online_page_dimension using online_page_key. The superprojections were replaced through a comprehensive design from Database Designer. The projections deployed by Database Designer are not sorted on the join key.

The query plan with base projections looked like:

The test platform is a 3 node VM. Statistics are updated after projection creation. Output is directed to /dev/null, and timing is enabled. Only the All rows formatted time is recorded.

Quick & Dirty

The quick & dirty approach would be to sort the fact and dimension table on the join key in subqueries:

There are additional steps to complete the sort, but Vertica will be able to perform a merge join:

Specific Projection

Typically, when optimizing for a merge join, projections should have the join key be the first sorted column. Vertica will also perform the merge join if the join key is second in the sort order, following the column used in the equality predicate 1.

For this specific projection, the Database Designer was used with the base example query. The proposed projections would both be sorted on the join key (online_page_key). The fact table would be segmented, while the dimension table would be unsegmented.

The plan with a specific projection looked like:

In this test, a hash join was used as the most perceived efficient option. To force a merge join, the optimizer was told to ignore the base projections:

Which finally produces a plan using a merge join:

Results

TestBase ProjectionsBase Projections +
Sorted Subquery
Specific Projection
1112680.006114817.768107569.089
2116154.236119735.560108465.071
3111775.487121070.775108120.184
4108143.217115558.792108331.512
5112770.930114234.563114197.235
Avg112304.775117083.492109336.618

In these tests, the specific projection improved performance by less than 3%. This figure would be more substantial if the entire data set wasn’t materialized. The base projection with sorted subqueries had the worst performance due to the additional steps required in sorting the data sets.

As mentioned in the background, Vertica can typically process hash joins quite fast and efficiently. In situations where a spill to disk occurs, optimization for merge join may be needed. If a query is spilling to disk often enough, a specific projection might be beneficial. However, exploring other options such as passing an ENABLE_JOIN_SPILL hint to the execution engine in the query may temporarily remedy such a situation.

Documentation

1 Optimizing for Merge Joins

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user's employer.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: