Twister2 — A new data processing platform

Supun Kamburugamuve
2 min readNov 27, 2019

Data processing is becoming part of everyday life for any organization looking into the future. Big data needs clusters of computers to store them, manage them and analyze them due to both space and computational requirements. Hadoop (map-reduce) is the first system to manage large data sets and process them at large scale. It is fair to anoint Hadoop as the first BIG DATA system. Then came Apache Spark, Apache Flink, Apache Beam, and Apache Storm; four of the most successful data processing systems. Many more systems came, but these were the most successful ones.

Before all these systems came to existence, people were doing calculations using large clusters using thousands of machines with hundreds of thousands of CPUs. These systems started early 1980’s and converged towards the Message Passing Interface(MPI) in the 1990s. High-Performance Computing (HPC) systems are still being used by scientists' for large scale calculations as none of the big data processing systems are capable of scaling or handling such computations.

Twister2 combines the power of HPC computing and data management capabilities into a single framework. It is developed from bottom-up to be a data processing system while simultaneously leveraging the MPI specification for calculations. It is an Open-source project with the Apache 2 License.

The documentation can be found in

We are up to the 0.4.0 release of Twister2. Twiste2 is primarily written in Java and it has a Python API. Some of the highlights of Twister2 are.

  1. Streaming and batch natively (Not a batch engine on streaming or streaming on a batch engine)
  2. High performance — can leverage advanced networking hardware to accelerate processing
  3. Similar APIs to Spark, Flink or Storm.
  4. Python API
  5. Support for Apache Beam API and Apache Storm API
  6. Kubernetes based deployments
  7. Integration of MPI style and data processing applications

With the current release, we are just scratching the surface of what data processing should be. We are working on integrating with other hardware accelerator frameworks such as Rapids.io and UCX. We are also planning to build an SQL interface for Twister2 as well.

If you are the adventurous programmer looking for new ways to do data processing please join the discussion.

twister2@googlegroups.com

--

--

Supun Kamburugamuve

Co-Author of "Foundations of Data Intensive Applications"