Best Practices

Enforcing Uniqueness of Data on Load

0

Vertica does not enforce uniqueness of primary key or unique constraints on load due to the potential overhead associated with the operation. There are ways to enforce uniqueness on load such as using MERGE, ANALYZE_CONSTRAINTS() or staging the data in a temporary table. This post will discuss these methods and potential restrictions around them.

Background

Some people may question the need for primary key or unique constraints in Vertica, but there are situations in which the use of primary keys is beneficial. Let’s first see how Vertica handles duplicate values:

In a traditional DBMS, an error for a primary key violation would be raised. However, Vertica allows the duplicate value to be inserted without error. When dealing with large amounts of data on load, taking the time to check uniqueness can become expensive. Instead, the uniqueness is checked at query time, for example, when performing a join.

As I discussed in the Hash Join Operator post, the optimizer actually benefits from explicitly defined primary and foreign keys as it will use the table with the primary key as the inner table and the table with the foreign key as the outer table. This saves the optimizer the step of determining which table is smaller.

Vertica also automatically adds a UNIQUE and NOT NULL constraint to primary key columns. Lastly, columns that are primary keys help when optimizing for a merge and become excellent candidates for hash segmentation in projections.

Below we’ll look at some methods to help enforce uniqueness at load.

Merge

A MERGE operation works by performing an update and insert on rows in the target table from the rows in a source table based on given parameters:

Initially, our target table contains the following values:

We will then merge the source table into the target table:

After the merge, our target table now contains the merged value:

It’s important to remember that MERGE can be extremely expensive1 as it combines the INSERT, UPDATE, and DELETE operations in a single operation.

The documentation1 also discusses limitations with running MERGE on identity/auto-increment columns, columns with primary/foreign key constraints, named sequences or unstructured tables. Optimal projections are also recommended2 to allow for a faster merge join.

1 MERGE
2 Optimizing Projections for MERGE Operations

Analyze Constraints

The ANALYZE_CONSTRAINTS()3 function identifies constraint violations within the defined search path. Before committing a load, this function can be used to identify and handle duplicate values:

Before we COMMIT the second time, we can run SELECT ANALYZE_CONSTRAINTS('public.source'); to identify constraint violations. This will give us:

This function can be used with the COPY command by specifying NO COMMIT to enable review of constraint violations or errors generated during the load. The function can also be used to examine existing data.

3 ANALYZE_CONSTRAINTS()

Staging Data

Staging the data in a temporary table4 allows for uniqueness to be checked before loading into its target table. For example, if additional transformations are being performed, you can use a temporary table to perform the transformations and check for possible constraint violations.

4 Creating Temporary Tables

Outside Transformation

Lastly, the uniqueness of data can be enforced outside of Vertica in an ETL tool. Performing this analysis outside Vertica allows resources to be allocated to other tasks.

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user’s employer.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: