Simple solutions that are capable of solving a large number of problems ignite the creativity side of the engineers. Hadoop is a great system, developed around a single operation called map-reduce. It is simple enough to be understood by many people and generic enough to solve many problems. For a long time, people believed that Hadoop can be good at solving many problems with different computations requirements. People wrote machine learning libraries around Hadoop and many research papers were published that either showcased how to get good results from Hadoop or improve Hadoop to achieve better performance. All these efforts are mostly abandoned and people no longer view Hadoop as a computation platform. Now, its use is mostly restricted to the use of HDFS part of it to manage large amounts of data.
All this is not Hadoop’s fault, and I repeat it is a wonderfully engineered system and has been used by many to solve everyday problems. There are two points I would like to address here and they are
- Hadoop design is generic enough to solve many problems
- Hadoop design is too generic to efficiently solve many problems
These two points go hand in hand with the rise of Hadoop and the fall of Hadoop as a computing platform.
Now let’s examine the core of what map-reduce offers. It offers three things. A task called map, a network and disk-based operation called shuffle and another task called reduce. A program starts with the map task. The map task can read data and produce a [Key, Value] output. This output is sent through the shuffle operation and presented to the reduce task. The power of the map-reduce paradigm lies with the shuffle operation. Otherwise, it will be just two tasks called map and reduce and there will be no value added in a distributed environment.
In a distributed environment the network operations are the most important aspect as they define the type of programs possible. In the context of later engines such as Spark or Flink, we can say Hadoop offered two tasks with a fixed shuffle operation between them. With Spark and Flink, one can use different operations between the tasks.
Now let’s look at the semantics of the shuffle operation. Shuffle is an all-to-all operation meaning, data in one parallel task instance can be distributed to all the tasks involved in the computation. In the map-reduce case, one map instance can generate data that can potentially be distributed to all the other tasks. With map-reduce, the distribution is based on a key associated with each tuple.
In parallel computing, several operations are widely used. These are
These operations are sufficient to program a large chunk of the parallel applications. As you can see map-reduce implemented only the AllToAll from these sets of operations. In the map-reduce implementation of AllToAll (shuffle), it is generic enough to act as other operations. Let’s look at how one can simulate these operations with the shuffle operation.
- Reduce — Use the same key from all the map tasks. This will send the values to one reduce task and we can do a reduction of all values
- AllReduce — Used in iterative cases, use as in Reduce and save the value to HDFS, essentially making it available to all the tasks.
- Gather — Use the same key from all the map tasks without a reduce function.
- AllGather — Same as in the AllReduce case. Save the values to HDFS.
- Broadcast — Hadoop provides a broadcast operation. Also one can save the values to HDFS to make them available to every task.
- AllToAll — This is the shuffle operation.
- Scatter — Use the keys to scatter values.
Hadoop developers generalized the shuffle operation using the above workarounds to write rich applications.
Programmers who study parallel computing in HPC environments know that there are separate operations for a reason. Otherwise, they could also use the same AllToAll operation. These operations use different mechanisms to optimize their performance. The way reduce operation is implemented is different from the way AllToAll operation is implemented. Even different implementations are used within a single operation depending on the size of the messages. Using the same operation for all the cases, means we are using an implementation targetted towards one type of data distribution to another data distribution pattern. This makes a huge difference in performance and there is a large amount of research behind this.
Imagine we designed a well-built large truck that can carry stones, people or children. We can use it for anything, but as we know, going to work daily in a truck doesn’t make sense. Hadoop is that large truck for distributed data applications. Hadoop’s shuffle operation is generic enough to be used in many applications but it is too generic to be efficient enough to be used in all the computing problems. This is the reason for its rise and fall as a computation platform. If you are only doing pure shuffle operations Hadoop is still a good solution.