All organizations have some sort of ETL or ELT process. At the Big Data Conference, Jack Gudenkauf of Playtika introduced a new type of process, named PSTL or Parallelized Streaming Transformation Loads allowing for parallelization of the data load and transform process throughout the entire pipeline.
Playtika is the leader of social casino gaming with games such as the World Series of Poker, Bingo Blitz and Slotomania. Jack Gudenkauf, the VP of Big Data at Playtika, lead the development and implementation of an architecture and solution which provided scalable and reliable management of the entire pipeline of all these external studios.
Jack’s inspiration for this solution came from working as Manager of the Twitter Analytics Data Warehouse team, and his time at Microsoft, where he invented a precursor to the PSTL system called “Shuttle”, which was a multi-tier distributed data platform to facilitate the importing of disparate MSN.com data. For a few years, Jack had a vision of a unified data pipeline with parallelism throughout; from the streaming in of data, through the transformations covering poly-structured data. His solution had to have high availability, strong durability, increase productivity, enable analytics, support SQL, be performant, and scale at every point in the pipeline.
Originally at Playtika, the sources of truth were siloed databases for each studio such as Bingo Blitz or Slotomania. The method of identifying users or sessions was made locally and then fed into their global data warehouse, Vertica.
In the original architecture, the game applications would write to Flume, next to a Java ETL parser and loader, then load into Vertica using COPY. There was a need to create a global source of truth in the global data warehouse with a robust solution which could be parallelized.
In the new architecture, the game applications are able to write to a local Kafka queue as producers. This small change allows reading in parallel of real-time messages from the “queues”. Next, Spark is used instead of a monolithic Java application for ETL. When partitions are read from Kafka, they are loaded directly into RDDs (Resilient Distributed Dataset) allowing for parallel execution of transformation across the cluster. Lastly, instead of a single COPY command to load into Vertica, multiple COPY streams can be made in parallel.
The differentiator in this architecture is multiple applications can run in parallel and have transformations running independently of the extracts. Additionally, the loaders can write independently of the transformations. With a solution such as PSTL, this architecture can be parellelized in each step. This enables analytics over semi-structured data, machine learning without consuming resources on Vertica, and data validation throughout the entire pipeline.
In Playtika’s solution, a topic in Kafka represents an application combined with a session or user and some sort of delineation of data which happens to match the Vertica fact tables. Through this, topics can be partitioned allowing for multiple streams reading partitions of data. The RDDs in Spark are partitioned elements in memory across the Hadoop cluster and can be operated on in parallel. Users and sessions are imported and mapped to give consistency across all data stores. A UserId globally identifies a user throughout the system.
The Vertica projection hash function on UserId is leveraged before data gets loaded, to determine which specific Vertica node a record would be loaded too. Using Spark, the session, user, and other RDD record column(s), are hashed to match the projection column(s) of the destination Vertica table. This avoids data movement between Vertica nodes on load.
Data is loaded in parallel from in-memory RDD partitions directly to specific Vertica nodes based on the hash(). This is accomplished using a Vertica User Defined tcpServer COPY Source. The source being Spark RDD Partitions, or any streaming source such as netcat, which streams data in parallel over a socket, to each Vertica node listening on a given port. Using a single Vertica COPY command per table, all Vertica nodes listen on a given port, and each node may also have a level of parallelism set (e.g., 4), allowing direct, concurrent, parallel writes to each Vertica node:
COPY schema.tableName WITH SOURCE SPARK(port='12345', nodes='node0001:4,node0002:4,node0003:4,node0004:4,node0005:4,node0006:4,node0007:4,node0008:4') DIRECT;
After loading into the Vertica data warehouse, the RDDs will write the raw un-parsed JSON, and the transformed structured Spark Data Frames (RDD’s with schema) that now match Vertica table definitions, to Parquet/ORC format HDFS files. Each streaming batch of Kafka partition data offsets, that were processed, is stored in MySQL, which enables an idempotent replaying of data in case of bugs in the PSTL.
The architecture is a true parallel scalable model.
Playtika was able to load 451 GB in 7 minutes 35 seconds using an 8 node cluster. As a reference from last year’s post, Facebook needed a 270 node cluster with 45 dedicated loader nodes to ingest 35 TB/hour. Playtika can achieve this ingest rate with an 81 node cluster and no dedicated loader nodes. The solution is also scalable.
Jack’s session on the PSTL architecture and solution packed the room at the Big Data Conference. The session not only highlighted an amazing architecture and solution, but the innovation which leveraged Kafka, Spark and Vertica. Moving away from dedicated ephemeral nodes to having targeted pre-hashed data loads leads to faster and more efficient data ingestion. Organizations will be able to test out new techniques for loading when the hash function is released as part of Excavator.
Lastly, Jack remarked that he was extremely grateful for the support from technical resources from Vertica.
They’re amazing! Having shipped 15 products at Microsoft, I can tell you that you would be hard pressed to get this kind of love from most vendors.
October 19, 2015
The session is now available for viewing.