This post will give an overview of Vertica’s pre-join projections. Pre-join projections are not frequently used due to the manual set up work and limitations around its usage. However, they can deliver increased performance at query execution by moving the join step into the load.
Vertica’s projections are materialized views that provide physical storage for data. A pre-join projection is more query specific and takes the optimization step out of the actual query execution. It performs additional work at load to give increased performance when queries using joins are executed. The result of the join between two or more tables is sorted in the pre-join projection to provide additional performance to the query.
Pre-join projections have the following limitations:
- A maximum of one pre-join projection can be used per query; any other pre-join projections are treated as regular projections
- The query must contain a superset of the join predicates used in the pre-join projection
- The anchor and fact tables must have primary and foreign keys explicitly defined
- Due to the primary and foreign key constraints, the query can only be an
- Data must be loaded into the anchor table before it can be loaded into the fact table
- The anchor data cannot be partitioned in order to prevent the dropping of the fact data; which would lead to a huge delete operation
The definition of primary and foreign keys must explicitly contain a
NOT NULL in the table’s structure. The uniqueness is enforced with pre-join projections during load. Additionally, changes to the table structure could cause data expansion and impact query performance. A poorly written pre-join could produce incorrect results.
A pre-join is not recommended for tables that experience high deletes and updates. Since the pre-join projection uses an
INNER JOIN, the delete must be cascaded between the anchor and fact tables to delete associated data.
The most important factor is being able to complete the load before another attempts to start. When loading to WOS (not directly to disk), if the moveout does not complete in the allotted time, it will be queued and could cause the moveout to time out. However, the moveout interval can be increased and the WOS container size can be shrunk.
Aside from these considerations, the advantage is that there is no need for the optimizer to join the tables at query execution. This saves on IO as there is only a single set of files to read from, and no network or local processing of the join taking place. Also, if the columns used in a
GROUP BY exist in the pre-join, a
GROUP BY PIPE can be performed. This solution would be ideal for a frequent query that has infrequent loads.
The documentation below provides an example of how a pre-join projection schema is used.
If a pre-join is not a viable solution, the anchor and fact tables could instead be optimized for a merge join. Lastly, the denormalization could take place outside of Vertica (ie. in an ETL tool). If you create a pre-joined object before it comes into Vertica, you aren’t not constrained by the Vertica pre-join limitations.