Elasticsearch is a popular full text search engine built on top of Lucene that can enhance a business analyst decision making process by giving them the ability to quickly “search” for random content from documents or text that might not normally all be stored in a database. This kind of task allows individuals to search for nuggets of information that needs to be blended with data in a relational database such as Vertica to provide a more complete picture of the business situation.
Combining text content from a product like Elasticsearch introduces a new kind of workflow that is more of a bottom up approach, as an end user I can search my content for fine grained information and then identify if there are any trends. A typical Business Intelligence (BI) perspective is often the reverse, start with a summary and then drill down to detail. Here is a simple UI example showing the results of Elasticsearch integrated with data from Vertica.
In the screenshot above the end user has content from the game chat blog in the center and along the left a brief summary of the chats showing a breakout of both sentiment of those chats (T= positive, F=negative) and a distribution by the sender (in our example game players). The charts in the middle are showing summarized data from Vertica about the number of games played broken out by various business dimensions such as device, region and game played. Thus the question…Where does search fit in the world of analytics? Does it have to be one or the other or can they both be integrated to provide a seamless solution? This blog will demonstrate how to integrate a search engine (Elasticsearch) with Vertica allowing end users to leverage the benefits of both products.
End Users Perspective
From an end users perspective, often the best method to maintain productivity is to have an integrated solution when performing a particular analytical task. Traditionally most of the time a person spends doing research on a particular problem can often be solved using a BI based analytical tool fed with data from Vertica. More and more use cases are now starting to blend the traditional relational world of data with more semi structured or unstructured information. Some BI and ETL tools are starting to incorporate the ability to blend data from a wide variety of nontraditional sources such as:
Other BI products provide methods such as custom java beans, or other frameworks such as data federation layer that involves additional installation, setup, configuration and complexity.
Although these options might work ok for low volumes it is not an optimized solution since each user or BI tool will have to implement their own integration and these options typically will not scale well with big data. There are other potential issues of having a BI tool directly connect to these sources such as:
- Authentication/Authorization/Control – undirected access can get out of control.
- Corporate firewalls proxy settings – are often issues with end users.
- Performance – end user will be most likely joining to other data sources on the client or BI server that does not have the compute power of a cluster of servers.
Some BI products also provide a “search” capability but it is limited to the content in the BI Server.
A More Optimized Approach
A more optimized solution to connecting to sources like Twitter, Hadoop or Elasticsearchis to have it be done via a backend process that can be leveraged across all environments such as Vertica’s Hadoop connectors or the Vertica Social Media Connector for Flume on Github, or written via well published frameworks such as the Vertica UDX architecture. In this blog we will show how easy it is to connect to text data stored in Elasticsearch via two different methods, Vertica UDX and Vertica Flex Tables.
Here are the Vertica tables used in this example.
Use Case 1: Well Known Search Terms (with Vertica Flex Zone)
The first use case centers around the situation of a gaming company that might have lots of text that has been captured on game chat rooms. This data needs to be blended with data in Vertica so analytics such as finding the sentiment of our most valued customers based on searching data in the text of chat room messages. This text data could also be put into Vertica as well and use the Vertica search text index capability but some customers also like to have the option of keeping it in their text based repository for different query languages like what Elasticsearch provides.
The first option for integrating text based data from Elasticsearch is to use the Out of the Box Vertica capability called Flex Zone. Flex zone is great for handling use cases that are considered schema on read and having the data in json format, which for the Elasticsearch use case is perfect. In our scenario we have some “well known” search terms that we want to perform analytics on using our favorite BI tool and merge those terms with data in Vertica. The following is a simple script that loops thru a file that contains our well known search terms from Elasticsearch.
echo "truncate table es_search_results;"
NAMES="$(< names.txt)" #names from names.txt file
for NAME in $NAMES; do
echo "drop table flexternal_"$NAME"_view cascade;"
echo "CREATE FLEX EXTERNAL TABLE flexternal_"$NAME"() AS copy SOURCE curl(url='http://markwesbox:9200/sessionmsg/_search?size=1000000&q=msg:"$NAME"') PARSER FJSONParser(start_point='hits', start_point_occurrence=2);"
echo "select compute_flextable_keys_and_build_view('flexternal_"$NAME"');"
echo "insert into es_search_results select _id,_index,_score::float,\"_source.is_positive\"::boolean,substring(\"_source.msg\",1,1500) ,\"_source.roomid\"::integer,\"_source.roominstanceid\"::integer,\"_source.senderid\"::integer,\"_source.sessionid\",_type, '$NAME' as search_term from flexternal_"$NAME"_view;"
echo "commit;" With flex tables
(Note: rest call above uses the ‘start_point_occurence’ which is available with Vertica 7.1 SP2)
As you can see, the flex table is created for each search term and then we insert all the results into a common table called “es_search_results”. All we need to do know is point our BI tool to this table and we now can join our Elasticsearch data to our player table for analysis. The gender is from our data mart master player dimension and the new search_term, nbrofhits and avgscore are all now in the Vertica table sourced from Elasticsearch with the script above.
The benefits of this approach are:
- Better utilization of system resources. This data is loaded in a batch fashion during off hours and takes up less disk than loading all chats.
- Better performance for BI query perspective – Data is already loaded into Vertica.
- Smarter data in Vertica – Only load the text data that is most relevant for analytics.
Use Case 2: Unknown Search Terms (with Vertica UDX)
The next use case is for situations where the end user does not know ahead of time what search term to look for in while using their BI tool. To solve this use case a very simple Vertica UDX Scalar function was written to access data in Elasticsearch. The meat of the UDX is listed below:
Now we can invoke the Vertica UDX to make a call to Elasticsearch to see if any of these players have also chatted the term of ‘happy’. All that needs to be done is to provide prompt values for the Vertica UDX when we drag and drop the Dynamic Query Subject onto our existing query from above. Most BI tools provide the ability to prompt for values when invoking a Vertica UDX which is what is shown below:
From the UDX above the prompts from the user be provided to the UDX:
- field_toSearch = “senderid” this is how we can link a chat to a user in vertica.
- field_value = the senderid of the 724 female players from the above query.
- search_term = in this case we put in the word ‘happy’
- search_type = ‘term’ in the UDX we pass the type of search to Elasticsearch.
And as you can see we do have a couple of users who have chatted the word ‘film’ and chatted the word ‘happy’.
The benefits of this approach are complete adhoc access to any text data in Elasticsearch.
Elasticsearch is actually part of a stack called ELK (Elasticsearch, Logstash and Kibana). Kibana provides a dash boarding capability that reports on data from Elasticsearch, but it is limited to data that is only in Elasticsearch. As demonstrated in this blog it is possible to combine the best of both BI and search worlds using Vertica as the hub of integration. Having the database be the glue for integration is more efficient because:
- Leveraged integration work – can be leveraged to all BI and ETL teams.
- Best practices for implementation – of these kinds of sources can be done by the people who have the skills to ensure they are optimized, secure and conform to corporate standards.
- Performance – leverage the MPP database cluster for parallelization.
Stay tuned for a follow up blog that will explain in more detail the code examples used in this post.