There are two operator types, merge and hash; each which can be performed on the local node (a local join) or across two or more nodes (a network join). This post will discuss the two operator types and the differences between a local and network join. The mechanics of each operator may be discussed in future posts.
Merge vs. Hash Joins
The most common type of join, the hash join, may be just as efficient as a merge join if it can build out the hash table in memory. Otherwise, optimization for a merge join should be performed.
The issue with a hash join is that if it spills to disk, you will have wasted valuable resources to process the query. Initially, Vertica will attempt to build out the inner table in memory. If the memory fills up, by either hitting the memory limit of the resource pool or filling up all the memory, the query will fail.
At this point, the query will be reprocessed with a hint indicating that the hash table for the join needs to be built out on disk. When a spill to disk occurs, the hash table is being built out on disk incurring more IO cost than necessary. Most importantly, the processing of the query has been slowed down.
In a merge join, both projections participating in the join are sorted or the subqueries for the join are sorted. This allows matched values to be streamed through memory rather than built out as a table. This type of join is optimal for dealing with large data sets as you do not want a spill to disk to occur.
The merge join will always be more efficient than a hash join, but not necessarily faster. If the hash join is working with a small set of data, it’s possible that it can process it faster. However, the hash join will always use more memory.
Local vs. Network Joins
When optimizing join’s, you want to minimize network activity in order to process the query. If the join being performed is with two segmented projections, a network join will take place. If the projection of the dimension table is replication, the query can be processed without involving other nodes and a local join will take place. If there are two identically segmented projections and segmentation is the same on each node, the join can happen locally.
The following illustration shows a NETWORK JOIN (not identically segmented) because the matching data is not on the same node.
The following illustration shows a LOCAL JOIN (assuming dimension table is replicated) because the matching data resides on each node.