Large-scale data engineering has gone through a remarkable transformation over the past decade. We have seen wide adoption of Big Data frameworks from Apache Hadoop, and Apache Spark to Apache Flink. Today, Artificial Intelligence (AI) and Machine Learning (ML) have further broadened the scope of data engineering, which demands faster and more integrable systems that can operate on both specialized and commodity hardware.

A data science workflow is a complex interactive process. It starts with data in large data stores. We create structured data sets from these row data using ETL (Extract, Transform Load) tools such as Hadoop or Spark…


Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering in HPC is to transform data from original data to vector/matrix/tensor formats accepted by scientific applications and deep learning applications.

There are many data structures such as tables, graphs, and trees to represent data. Among them, tables are a versatile and commonly used format to load and process data. Cylon provides a distributed memory DataFrame API on Python for…


Simple solutions that are capable of solving a large number of problems ignite the creativity side of the engineers. Hadoop is a great system, developed around a single operation called map-reduce. It is simple enough to be understood by many people and generic enough to solve many problems. For a long time, people believed that Hadoop can be good at solving many problems with different computations requirements. People wrote machine learning libraries around Hadoop and many research papers were published that either showcased how to get good results from Hadoop or improve Hadoop to achieve better performance. All these efforts…


Data processing is becoming part of everyday life for any organization looking into the future. Big data needs clusters of computers to store them, manage them and analyze them due to both space and computational requirements. Hadoop (map-reduce) is the first system to manage large data sets and process them at large scale. It is fair to anoint Hadoop as the first BIG DATA system. Then came Apache Spark, Apache Flink, Apache Beam, and Apache Storm; four of the most successful data processing systems. Many more systems came, but these were the most successful ones.

Before all these systems came…

Supun Kamburugamuve

High performance data analytics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store