Best Practices, Database Design, Optimization, Performance Tuning

Partitioning: Benefits, Restrictions and Performance

4

This post will focus on the benefits, restrictions and performance of partitioning projections. I’ll also explain my methodology for identifying tables for partitioning. A comprehensive guide on working with table partitions is available in the documentation.

Benefits

The main goal of partitioning is to break up data in a way that improves performance. Partitions in Vertica can improve parallelism during query execution and take advantage of partition elimination. In partition elimination, ROS containers that don’t meet the query predicate on the partitioned column can be omitted at query run time. Additionally, partitions can be quickly dropped. Vertica also comes with auto partitioning, where data is automatically partitioned when it is written to disk.

Vertica also provides several functions to easily manage partitions such as merging, deletion and moving. Deleting partitions is useful when data needs to be removed and the delete is being performed on the partitioned column. Instead of performing a delete and purge, the affected partitions can be dropped and the corresponding ROS containers are eliminated. Moving partitions is useful when archiving historical data and Vertica provides a function to easily move partitions from a source table to a target table.

Restrictions

  • Partitions will apply to all projections for that anchored tabled
  • Vertica supports up to 1024 partitions per table; however, they recommend between 10 and 20 partitions for optimal performance
  • Partitioning should be used on large fact tables instead of dimension tables, especially with pre-join projections
  • The partitioned column cannot contain NULL values for any row, and the column should be explicitly defined with a NOT NULL
  • Once a partition expression is created, it cannot be modified
  • When dropping partitions, the tables need to be up to date and only reside on WOS

Performance

Background

The goal of this test was to evaluate query time against a table with and without partitioning. To further evaluate the performance, the partition was tested with segmented and unsegmented projections, as well as with and without statistics. The test was performed on a two-column table with about 38 million rows distributed across 12 months:

Partition Expression
Test Query

Row output was directed to /dev/null.

Tests

All times in ms

Without Partitioning With Partitioning
First Fetch
(1000 rows)
All rows
formatted
First Fetch
(1000 rows)
All rows
formatted
Segmented All Nodes – No Statistics
26.775 538.796 27.349 528.812
24.951 539.756 19.386 520.012
40.236 544.187 20.267 519.574
24.915 528.968 20.003 515.215
26.396 535.522 20.293 529.198
28.655 537.446 28.373 530.660 (▼ 1.26%)
Segmented All Nodes – With Statistics
31.881 535.274 19.765 518.351
27.492 525.719 20.21 523.06
26.533 527.872 23.526 520.863
29.747 537.857 20.004 516.769
26.214 526.578 19.797 519.663
28.3734 530.66 20.6604 519.7412 (▼ 2.06%)
Unsegmented All Nodes – No Statistics
9.399 523.302 8.948 509.865
8.946 513.976 8.237 507.609
10.03 516.149 8.353 506.028
14.637 527.285 6.719 502.864
10.811 518.259 7.514 505.582
10.7646 519.7942 7.9542 506.3896 (▼ 2.58%)
Unsegmented All Nodes – With Statistics
10.562 516.557 6.882 504.047
8.887 518.018 6.926 501.686
8.638 517.432 8.397 504.165
8.496 517.689 6.817 503.543
9.263 515.791 7.435 505.599
9.1692 517.0974 7.2914 503.808 (▼ 2.57%)
Optimizer Plan Summary
Without Partitioning With Partitioning
Cost Rows Execute on: Cost Rows Execute on:
Segmented All Nodes – No Statistics
19K 38M All Nodes 19K 38M All Nodes
Segmented All Nodes – With Statistics
7K 498K All Nodes 86K 515K All Nodes
Unsegmented All Nodes – No Statistics
75K 38M Query Initiator 75K 38M Query Initiator
Unsegmented All Nodes – With Statistics
28K 1M Query Initiator 27K 1M Query Initiator
Results

While the improvement in this test was minor, the improvement would be better realized on a larger, production data set. The unsegmented (replicated) projections on each node deliver the fastest result; however, partitioning should typically be used on larger fact tables as mentioned earlier.

The estimated costs and rows after updating statistics were fairly similar between the partitioned and non-partitioned tables. The actual rows returned in the test query was 736k.

You may be wondering how to actually check that partitions were eliminated during query execution. The system table v_monitor.query_events will contain a PARTITIONS_ELIMINATED event for the query. The event_details column contains information on how many partitions were used for the query:

Depending on your environment and how the data is segmented or not segmented, you may see more than one event for partition elimination.

Methodology

I begin identifying potential tables to partition by looking at the number of query requests against tables. Large tables with high query usage become good candidates for partitioning because they could potentially benefit from partition elimination. I also look at existing tables with partitions1 and see if the partition expression is still optimal.

The next step is to determine the best column on which to partition. The most common type of data to partition is date/time, although Vertica allows partition expressions referencing more than one column. As mentioned earlier, columns in the partition expression must not contain NULL values. You may want to look at query history to see what columns are frequently included in the predicate.

Lastly, you can use ALTER TABLE with REORGANIZE to add a partition expression. If you are modifying it, you will need to recreate the table with an updated partition expression.

1 Tables with existing partition expressions

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user's employer.

4 Comments

  1. kesten December 12, 2014 at 8:30 AM -  Reply

    I’d be curious to see this test run when a join is required. If partition elimination could reduce the size of the join to avoid spilling to disk i would expect a much bigger gain.

    • Norbert Krupa December 12, 2014 at 9:36 AM -  Reply

      Do you mean a pre-join or a straight join? Vertica should optimize with a join, especially with a predicate.

  2. Nithin Mohan February 9, 2015 at 12:50 PM -  Reply

    Hi Norbert,
    Am trying to populate the Date Dimension table in Vertica. Is there any good article that gives us scripts to populate the Dim_Date table and have the right partition / projection for this table ?

    • Norbert Krupa February 9, 2015 at 7:52 PM -  Reply

      Nithin, if you need help creating a calendar table, read this post. Depending on how many years you need, you should try to target between 10 and 20 partitions. Any specific projections will depend on the use of the table.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: