Administration

Partitions With a Unix Timestamp

0

Some data may not be stored using the traditional date/time type. This post will explore basics around partitioning on a Unix timestamp, calculating a partition clause, and comparing the performance against a table with date/time partitions.

Sample Data

To create sample data, a numbers table is first created as explained in an earlier post. For a larger data set, the table will be cross joined against itself 8 times for 100M rows.

The goal of this example is to partition on month to obtain an optimal 12 partitions. Vertica officially recommends between 10 and 20 partitions per table, although it supports up to 1024 partitions. Below, TIMESTAMPADD is used with RANDOMINT to generate the test sample data:

After generating data, we have a good distribution of ~8.3M rows per month:

For the tests below, statistics will be updated after each set of partitions is created.

Date/Time

To achieve partitioning on month using TIMESTAMP, the partition is set to the year + month:

The query below asks for rows that have a date between June 10 and August 15:

Results
TestTime (ms)
124379.897
224736.195
324376.110
424476.926
524813.724
Average24556.570

Unix Time

As I detailed in my StackOverflow answer, it’s better to use a TIMESTAMP data type as it reduces the complexity of conversions when querying data and creating partitions. For this test, EXTRACT is used to create a Unix timestamp representation of the above date/time table:

Since the Unix timestamp is defined as the number of seconds since the Unix epoch (1970-01-01 00:00:00 UTC), the column containing the timestamp must be divided by the appropriate number of seconds for partitioning:

Partition ByDivide BySeconds
Minute6060
Hour60 * 603600
Day60 * 60 * 2486400
Week60 * 60 * 24 * 7604800
Month60 * 60 * 24 * 7 * 42419200
Year60 * 60 * 24 * 36531536000

The division of the column containing the timestamp with the corresponding value from the table yields the time unit since the Unix epoch. For example, partitioning by week requires division by 604,800 seconds. This returns the week number since the Unix epoch, which provides uniqueness and adequate partitioning, but not the actual representation of the week of the year.

Partitioning by month with a Unix timestamp is more difficult as each month has varying days. If the partition clause were 60 * 60 * 24 * 7 * (365.25 / 12) or 2629800, there could be duplicate months (since Epoch) causing skewed partitions:

Therefore, it’s better to use 60 * 60 * 24 * 7 * 4 or 2419200 to ensure uniqueness and avoid skewed partitions. The same idea follows for the year for using 365 common calendar days over the average Gregorian year of 365.25 days.

As above, the partition will be made on the month. To better understand how the partitions will be represented, the table can be queried using the partition clause:

The partitions are represented as the (adjusted) month since the Unix epoch:

Along with FLOOR, the partition clause can explicitly use 60 * 60 * 24 * 7 * 4 or simply 2419200:

In order to translate the above query for a Unix timestamp, EXTRACT(EPOCH... is used:

Results
TestTime (ms)
113872.347
213830.133
313786.550
413983.911
513750.251
Average13844.638

Conclusion

Partition Elimination

The obvious advantage of partitioning is partition elimination as seen in v_monitor.query_events:

The concept of partition elimination is covered in the documentation. In this execution step, partitions which aren’t matched based on the predicate can be potentially eliminated.

Best Type

Regarding the practice of using a Unix timestamp over a date/time TIMESTAMP, there is no practical advantage to storing data with a Unix timestamp. Unless the business is comfortable with using Unix timestamps for analysis, TIMESTAMP is the best option. It comes with a range of practical date/time functions as well higher precision and range.

In the tests above, the queries returned the same number of rows (16.6M). Fetching those rows using a Unix timestamp was 43.6% faster than date/time timestamp. However, there is a higher cost if that Unix time is converted TO_TIMESTAMP:

StorageRetrievalTime (ms)
TIMESTAMPTIMESTAMP24556.570
INTINT13844.638
INTTO_TIMESTAMP27914.617

At that point, there is little reason to use a Unix timestamp.

About the author / 

Norbert Krupa

Norbert is the founder of vertica.tips and a Solutions Engineer at Talend. He is an HP Accredited Solutions Expert for Vertica Big Data Solutions. He has written the Vertica Diagnostic Queries which aim to cover monitoring, diagnostics and performance tuning. The views, opinions, and thoughts expressed here do not represent those of the user's employer.

Leave a Reply

Upcoming Events

  • No upcoming events
AEC v1.0.4

Subscribe to Blog via Email

Enter your email address to subscribe and receive notifications of new posts by email.

Read more use cases here.

Notice

This site is not affiliated, endorsed or associated with HPE Vertica. This site makes no claims on ownership of trademark rights. The author contributions on this site are licensed under CC BY-SA 3.0 with attribution required.
%d bloggers like this: