Optimization, Performance Tuning

Hash Join Operator

0

My earlier post on the Overview of Join Operations discussed the join operations on a high level. This post will expand on the hash join.

The hash join is the most common join type used in join operations. It does not require optimization for it to be used. Generally, the hash join is efficient and can be as fast as a merge join if it has the necessary resources. However, if it does not have the necessary resource, it could spill to disk and cause unnecessary overhead.

Mechanics

The smaller table in the join is used as the inner table unless the primary and foreign keys are defined. When these keys are defined, the table with the primary key is used as the inner table and the table with the foreign key is used as the outer table. This saves the optimizer the step of determining which table is smaller.

In the illustration below, both the fact and dimension tables contain unsorted data. The inner (dimension) table is first built out in memory. Next, the outer (fact) table is scanned and matched against the table in memory. Each time a match is found, a tuple is created. After the outer table has been scanned, the result is processed locally or sent back to the initiator node.

Join Hash Animation

Join Spill

A join spill will occur if the hash table fills up the memory or hits the memory limit of the resource pool. When this occurs, the resources required to have attempted this join will have been wasted and the query is attempted again with the join hint /*+add_vertica_options(EE,ENABLE_JOIN_SPILL)*/. The hash table is then built out on disk requiring more IO operations.

Join spills can be found by looking in the Vertica log or after profiling the query. The join spill will appear as JOIN_SPILLED under Execution Events. Typically, optimization for a merge join should take place in this situation. When a join spill happens to a query that is seldom run, a merge join optimization may not be necessary and the hint can just be passed in the query.

Lastly, if data is being loaded into a pre-join projection using a hash join and it spills to disk, the load will fail. Ideally, the projections should be optimized for a merge join, otherwise the join hint needs to be used.

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user's employer.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: