I recently responded to a request to review the first (I’ve seen) published book on Vertica. The book, HP Vertica Essentials (Amazon & Packt Publishing), by Rishabh Agrawal intends to cover deployment, administration and management of Vertica. This review will attempt to break down the chapters and see how well the book covers those topics.
I was quite surprised how light the book is in terms of content. Out of 106 pages, 30 are for opening/closing details (cover, copyright, credits, reviewers, table of contents, index). This leaves some 76 pages to cover the following topics:
- Chapter 1 – Installing Vertica
- Chapter 2 – Cluster Management
- Chapter 3 – Monitoring Vertica
- Chapter 4 – Backup and Restore
- Chapter 5 – Performance Improvement
- Chapter 6 – Bulk Loading
The target audience for this book are “Vertica users and DBAs who want to perform basic administration and fine tuning.” Although, prior knowledge of Vertica is not mandatory (in my opinion, a user will most likely be lost in this book).
Chapter 1 – Installing Vertica
The author begins outlining the differences of Vertica from other MPP databases and mentions that data is stored in a columnar fashion, but misses encoding on top of compression, as well as other critical features of Vertica such as its high availability. I feel the author should have also included the features that come with each version (Community vs. Enterprise) of Vertica when mentioning the Management Console. It would have been helpful to mention that the logical design that’s typically performed at the database level can be taken to the schema level.
The pre-installation requirements attempt to be covered in 4 short paragraphs. There are many other critical steps to pre-installation such as OS level configuration and hardware planning that should have been touched on. The author suggests to keep 20-30 percent of disk space free on each node, however, the official recommendation is 40%. There is also no mention that Vertica can be run on AWS or that it can be run locally using a VM image from the Marketplace.
The rest of the chapter steps through the software installation process, and mentions it aims at covering a two-node cluster installation. I can’t really come up with any good reason to demonstrate a two node installation, as the most common installation has three nodes. The output from the installation script seems completely unnecessary.
Chapter 2 – Cluster Management
Most of the material in this chapter seems to imply that projections are strictly segmented, where they can obviously also be replicated (especially in the case of smaller dimension tables in a star schema). This isn’t hinted until Chapter 5. However, the author does a descent job of explaining how skew plays a factor in segmentation.
There appears to be confusion between adding hosts to a cluster and adding nodes to a database. The distinction should be more explicit and each process individually called out as it is with removing the node from the database and removing the host from the cluster.
There is also no mention that a K-safety higher than 2 isn’t really recommended or that a minimum K-safety of 1 is required for production clusters.
The rest of the chapter does a fairly descent job describing node/host management. However, the section on
spread isn’t really applicable after version 6.1 since spread is integrated into the OS.
Chapter 3 – Monitoring Vertica
This chapter fell short on a critical part of the workload management of Vertica. At the very least, there should have been a mention of resource pools, query requests, background processes, monitoring for potential problems, and on the data collector and its role in aggregating system data.
Chapter 4 – Backup and Restore
The author provides a good basic overview of the backup and restore process. Mentioning the miscellaneous settings, or parameters for
vbr.py seems unnecessary and a reference to the documentation would have sufficed. When mentioning the
copycluster, the author could have added more detail about how a dormant node can be used in a production environment for failover. I feel that this chapter could have highlighted more of Vertica’s high availability strategy, as some customers don’t even use backup.
Chapter 5 – Performance Improvement
The author does a poor job of comparing Vertica’s columnar architecture to traditional a row-store. While it is true that Vertica can only use columns involved with the query, this is also true in a traditional row-store under certain conditions with proper indexes. I feel a better approach would be showing an example of how the physical data is stored in each architecture.
The first section also incorrectly states that a superprojection gets created when the table is created (occurs at first data load). The remainder of the section does a reasonable job of introducing the concept. With regards to high availability and recovery, I feel it’s important to mention how Vertica uses checkpoints epochs with projections to recover data.
The material on the Database Designer seemed to completely miss the performance design priority (Balanced/Query/Load).
The remainder of the chapter briefly discusses the concept of ROS/WOS and Tuple Mover operations.
Chapter 6 – Bulk Loading
COPY command is covered in extreme brevity. There should have been some details about monitoring loads.
I feel that the book falls short on the discussed topics. Critical concepts such as the architecture, resource management, and monitoring/troubleshooting are not adequately covered. I couldn’t find anything that isn’t more thoroughly covered in the official documentation. There is too much space used on script outputs and screenshots.
It also seemed that the book tried to be version agnostic, however, there are many features such as the installation script, database designer and management console that have been dramatically improved and overhauled. The author should have explicitly mentioned this book focuses on version 6.1. The book comes late into the game as Version 7.0 was released late last year.
The amount of material in understanding the essentials of the platform would probably require at least three books (with 500+ pages). A proper anthology on the platform would probably look like:
- Performance Tuning
I rate the book 2 out of 5 stars.