Large-scale data engineering has gone through a remarkable transformation over the past decade. We have seen wide adoption of Big Data frameworks from Apache Hadoop, and Apache Spark to Apache Flink. Today, Artificial Intelligence (AI) and Machine Learning (ML) have further broadened the scope of data engineering, which demands faster and more integrable systems that can operate on both specialized and commodity hardware.
A data science workflow is a complex interactive process. It starts with data in large data stores. We create structured data sets from these row data using ETL (Extract, Transform Load) tools such as Hadoop or Spark…
Simple solutions that are capable of solving a large number of problems ignite the creativity side of the engineers. Hadoop is a great system, developed around a single operation called map-reduce. It is simple enough to be understood by many people and generic enough to solve many problems. For a long time, people believed that Hadoop can be good at solving many problems with different computations requirements. People wrote machine learning libraries around Hadoop and many research papers were published that either showcased how to get good results from Hadoop or improve Hadoop to achieve better performance. All these efforts…
Data processing is becoming part of everyday life for any organization looking into the future. Big data needs clusters of computers to store them, manage them and analyze them due to both space and computational requirements. Hadoop (map-reduce) is the first system to manage large data sets and process them at large scale. It is fair to anoint Hadoop as the first BIG DATA system. Then came Apache Spark, Apache Flink, Apache Beam, and Apache Storm; four of the most successful data processing systems. Many more systems came, but these were the most successful ones.
Before all these systems came…
High performance data analytics