The Three Core Concepts Behind Data Frameworks

Supun Kamburugamuve
7 min readFeb 6, 2023

--

The field of data science has become increasingly complex, with a variety of frameworks and programming APIs merging to solve more and more complex problems. This has resulted in a plethora of terms and jargon, such as batch and streaming data, machine learning, data lakes, deep learning, and SQL, which can be overwhelming for those new to the field.

Twenty to thirty years ago, when data science was in its infancy, developers wrote code mostly from scratch to solve data problems. This was due to either the small size of data, limited complexity of problems, or a lack of available frameworks and libraries. Today, the situation has changed significantly and data science has evolved beyond what can be achieved through writing code from scratch. The data has grown significantly, the problems are ever more complex and we need to act fast in developing solutions. It is now necessary to utilize frameworks and libraries, developed by experts, to efficiently and effectively solve data science problems in a reasonable amount of time.

A little bit of history

Before the advent of big data, traditional transactional databases, such as Oracle DB, MySQL, and Microsoft SQL Server, was used to store most of our data. These relational databases were highly successful and still store some of our most critical data to this day.

However, with the growth of big data, characterized by increased volume, velocity, and complexity of data, traditional transactional databases became insufficient. They were unable to scale due to the strict constraints imposed by the ACID model.

As a result, new storage systems were created to accommodate large amounts of data, with more flexible constraints and innovative ways to manage the trade-off between reliability and scalability. Databases such as MongoDB, Cassandra, HBase, and Hive emerged to store structured data, supported by large-scale raw storage systems such as Hadoop File System (HDFS), Amazon S3, and Azure Blob Storage.

The storage revolution was driven by the need for data processing frameworks to access the data stored. These frameworks developed parallel programming models, such as the map-reduce framework, which were later improved for data querying in Apache Spark and other similar frameworks.

Data science in three steps

Data science is a wide-ranging term that encompasses many different aspects of working with data. If we are forced to make a simplified view of data science we can describe it in three main areas:

  1. Storage: This involves storing data in a suitable format and location for analysis.
  2. Queries: This involves retrieving data from storage and processing it to answer specific questions.
  3. Learning: This involves using algorithms and models to extract insights and information from the data.

In some cases, simply querying the data may be enough to gain insights. However, in many other cases, more advanced techniques such as machine learning and deep learning are needed to generate meaningful and actionable insights from the data.

Each of these areas can be broken down into numerous substeps, depending on the data and the requirements of the applications. For example, data may be streaming in and needs to be cleaned before storage. This needs sub-steps like streaming data processing.

The process of converting stored data into the format required by a machine learning algorithm, either deep learning or traditional, is referred to as pre-processing. This involves extracting relevant data, verifying its accuracy, and transforming it to fit the algorithm’s requirements. Pre-processing is a time-consuming process that involves a significant amount of disk and network I/O (input-output), as well as computing.

After the pre-processing stage, we can run the machine learning algorithms on the processed data to create models or derive immediate insights. These algorithms can be demanding in terms of computing power and network I/O, depending on their computational complexity and the size of the data. The processing time can range from a few seconds to several days.

The last step is called post-processing where we use techniques like validation, visualization, and deployment to bring the finding to the users.

Data science workflows: When all these steps from storage to learning and deployment are combined to create an end-to-end application we call it a data science workflow. There are specific tools that are available to facilitate this step. Traditionally they have been called workflow engines, but with modern requirements, they are being called different names such as dataflow systems or task systems.

Data and Frameworks

There are many data frameworks available for data processing at different stages of a data science workflow. As a user, we are interested in two main areas of a framework.

  1. Deploy and configure the system
  2. Write our data processing applications

There are SaaS frameworks with fully managed infrastructure in the cloud and products that we need to deploy in our own clusters. Most systems are available as a standalone product as well as a managed cloud product offered by a vendor. Frameworks intended for smaller data volumes operate in a single computer, while those designed for larger data processing are capable of utilizing multiple computers. Features such as fault tolerance, security, and monitoring are present in every data framework and need to be configured and utilized according to the application and organizational requirements.

Data processing applications can be written using regular programming languages such as Python, Java, or C++ as well as using languages such as SQL. SQL is widely used in query steps and deep learning and machine learning algorithms are written using low-level languages such as C++ or configured using high-level languages like Python.

The primary function of any data framework is to deploy the application written by the user (using a given programming API) and map its execution to resources as such CPU cores, GPUs, networks, and disks.

As we can see there are many dimensions to a data processing system. Within all this complexity there are three core concepts that truly define what a data processing system is capable of and how efficient it is.

The Three Core Concepts

The main characteristics of a framework for working with data are defined by the programming model, data types, and operators. These three key concepts have an intricate relationship with each other and cannot be discussed in isolation as shown in the figure below.

Data types

Defines the types of data that a framework is designed to handle. Different domains of data analytics require different data types. For example, multi-dimensional matrices (tensors) are the go-to data structure for deep learning applications. Tables are extensively used in structured data querying systems, while graph data structures are used by graph processing algorithms. These domain data types are represented as data structures in computer memory and they define how to efficiently store the domain types so that they can be manipulated efficiently in computers.

Table data structures are represented using either columnar data formats or row-wise data formats. Dense tensors and matrices are stored using formats like column or row-wise formats. Sparse matrices are stored using formats like the compressed sparse row (CSR) or coordinate list (coo). It is important to note that for the core capabilities of a framework, it is irrelevant how the data type is stored in computer memory. Rather the storage format defines how efficient a framework will be for a certain type of application.

Operators

Encapsulates common patterns used with the data types to support the applications of the domain. For example, relational data processing systems use operators like joins, unions, and filters which define how to manipulate tables (relations). Tensors require operations like multiplication, addition, and inverse.

Depending on whether a system supports distributed execution, an operator can be distributed or local to the data present in one computer. For example, if table data is spread across multiple computers, we need a join operator that can work across multiple computers. This means underneath the join operator, it needs to access the network to distribute the data and synchronize with the other computers.

Programming models

Defines how the user programs an application given the data types and the operators on top of those data types. A programming model can use different concepts to represent the data and operators. For example, a table can be represented as many things in a programming model including, a list of tuples, and a table structure with rows or as a set of columns. A table can be distributed across multiple nodes and the programming model can hide this detail and show it as a single entity to the user. Another programming model can expose the fact that the table is distributed and allow users to work on distributed pieces directly.

Most data frameworks are designed and optimized for solving problems in a single domain as defined by the data type, operators, and programming model it supports. Some frameworks try to be available in different domains even though they are developed to solve problems in another domain. In most cases, such attempts are not successful because of the inefficiencies in mapping a problem to an in-efficient data type and operators. For example, it is inefficient to use a table data type (data frame) to work on a three-dimensional matrix problem (tensor).

Popular Frameworks

Here are some popular examples among the hundreds of frameworks found in the data science space.

  1. PyTorch: A tensor-based framework for deep learning that provides hundreds of operators for tensors to support deep learning algorithms and offers distributed execution capabilities. It offers a local view of the data and operators.
  2. Spark: A data frame-based programming model that shows a distributed table as a single entity to the user.
  3. Pandas: A single-node data frame for data manipulation. It only offers a data type and set of operators as it is not a framework.
  4. NumPy: A library for vector/matrix operations in a single computer, with a focus on arrays. It only offers a data type and set of operators as it is not a framework.

Summary

To study and evaluate frameworks designed for data processing, we can use the three core concepts discussed in this article; data types, operators, and programming models. They will guide you to find the correct framework for the problem at hand by going to the heart of the issue through the clutter of deployment, execution, and service level features such as fault tolerance, high availability, and performance of these frameworks.

--

--