My earlier post on the Overview of Join Operations discussed the join operations on a high level. This post will expand on the hash join.
The hash join is the most common join type used in join operations. It does not require optimization for it to be used. Generally, the hash join is efficient and can be as fast as a merge join if it has the necessary resources. However, if it does not have the necessary resource, it could spill to disk and cause unnecessary overhead.
The smaller table in the join is used as the inner table unless the primary and foreign keys are defined. When these keys are defined, the table with the primary key is used as the inner table and the table with the foreign key is used as the outer table. This saves the optimizer the step of determining which table is smaller.
In the illustration below, both the fact and dimension tables contain unsorted data. The inner (dimension) table is first built out in memory. Next, the outer (fact) table is scanned and matched against the table in memory. Each time a match is found, a tuple is created. After the outer table has been scanned, the result is processed locally or sent back to the initiator node.
A join spill will occur if the hash table fills up the memory or hits the memory limit of the resource pool. When this occurs, the resources required to have attempted this join will have been wasted and the query is attempted again with the join hint
/*+add_vertica_options(EE,ENABLE_JOIN_SPILL)*/. The hash table is then built out on disk requiring more IO operations.
Join spills can be found by looking in the Vertica log or after profiling the query. The join spill will appear as
JOIN_SPILLED under Execution Events. Typically, optimization for a merge join should take place in this situation. When a join spill happens to a query that is seldom run, a merge join optimization may not be necessary and the hint can just be passed in the query.
Lastly, if data is being loaded into a pre-join projection using a hash join and it spills to disk, the load will fail. Ideally, the projections should be optimized for a merge join, otherwise the join hint needs to be used.