Cluster Resources — Job Scheduling

6 min readFeb 1, 2025

A computer cluster consists of many computer nodes closely coupled with a network. A computer cluster runs large-scale parallel applications such as neural network training, scientific simulations, or large-scale data analytics.

The goal of a job scheduling system is to allocate resources to users or applications in the most efficient way possible. In a cloud environment, the objective may be to minimize the cost while maximizing performance. In a research cluster, one might aim to keep the cluster fully utilized all the time — these present unique challenges for the algorithms created to schedule the workloads on such clusters.

A job scheduling system takes a set of jobs as input and outputs a job schedule, which indicates the order of the jobs and the specific resources for each job. Finding the best schedule in a reasonably complex system with many jobs competing for resources is often challenging. Most algorithms make assumptions and lower the requirements to tackle the problem in a finite period, as there is no point in devoting much time to finding the best schedule while your resources are idling.

A scheduling system comprises a scheduling policy, objective functions, and scheduling algorithms. The scheduling policy states high-level rules to which the system must adhere. The objective function is used to determine a particular job schedule’s effectiveness. The algorithms incorporate the policies and objective functions to create the schedule.

Scheduling Policy

A scheduling policy dictates high-level requirements such as the type of resources accessible to users and the priority of different applications. For instance, we take a data analytics cluster that runs jobs daily to determine an organization’s critical information. This cluster is also shared with data scientists experimenting with their newest machine-learning models. The organization can define a policy saying the essential jobs need to run no matter what state the cluster is in. When this policy is implemented, an experimental job run by a data scientist may be halted to run a higher priority job, or the job may have to wait until the higher priority jobs are complete.

So, a scheduling policy contains a set of rules specifying what actions need to be taken in case more resource requests are available.

Objective Functions

An object function determines whether a resource schedule is good or bad. Objective functions are created according to the policies specified by the organization. A cluster can run many jobs at any given time. These applications demand various resource requirements (number of processors, nodes) and execution times. At any given time, the free resources of a cluster can be scattered across many racks of nodes.

Throughput and Latency

The throughput and latency of the system are two main metrics used to determine the quality of a scheduling system. Throughput is measured as the number of jobs completed within a given period. Latency can be taken as the time it takes to complete jobs.

Priorities

Algorithms can take the priorities of applications into account when scheduling. Usually, priorities are configured with different job queues. Jobs submitted to queues with higher priorities are served first. For example, when used with Gang scheduling, the higher-priority jobs will get more execution time than lower-priority ones.

Lowering Distance Among the Processes

A scheduling algorithm can lower the distance between various parts of an application in terms of networking to improve its performance. Slurm uses the Hilbert space-filling curve[5] to order nodes to achieve good locality in 3D space. The Hilbert curve fitting method transforms a multi-dimensional task allocation problem into a one-dimensional space, assigning related tasks to locations with higher levels of proximity. Lowering the networking distance can significantly increase the efficiency of network I/O intensive programs.

Data Locality

Data locality-aware scheduling is typical in big data schedulers like Yarn. It applies to clusters with computing and storage cohabited in the same nodes. A prime example is an HDFS cluster. A scheduler tries to minimize the distance between the data and the computing as much as possible. If all the cluster nodes are available, the application will be placed on the same nodes where the data is present. If the scheduler cannot run the application on the nearest nodes, it can instead put them on the same racks as the data.

When many jobs run on a cluster, it is impossible to guarantee data locality completely. Data can be spread across more nodes than the number requested by the application, or the application may ask for more nodes than the nodes with the data. Other applications might also be executing on the nodes with the data.

Data locality was much more critical in the early days of big data computing when the network was vastly slower than reading from the hard disk. By contrast, nowadays, networks can be much faster. In cloud systems, data is stored in services such as S3. As such, data locality is not as important as it was before.

Completion Deadline

Some jobs need to run within a given deadline to be effective. The resource scheduler needs the user to provide an approximate completion time to achieve this. Usually, we know the time to run an application with a set of given parameters and resources from prior experience running it. Resource schedulers do not try to allocate more resources (increase parallelism) to finish a job quickly, as it can decrease performance in some applications.

Scheduling Algorithms

The goal of a scheduling algorithm is to select which jobs to run when the demand is higher than the available resources. The algorithm should choose the jobs to create a good schedule according to the defined objective functions. We do not need to use an algorithm if we can run every job immediately.

There are many scheduling algorithms available in the literature. In practical systems, these algorithms are adapted to support various objective functions according to the applications they support.

FIFO (First in First Out)

This is the most basic scheduler available in almost all systems. As the name suggests, jobs are executed in the order they are submitted. It is simple to understand and works with systems with relatively less work than the available resources, although it can lead to idle resources and low throughput. We will see an example later with the backfill scheduling algorithm.

Gang Scheduling

Gang scheduling is a technique used to run multiple applications on the same resources simultaneously by time-sharing between them. In this case, applications are oversubscribed to the resources and must take turns executing like in OS threads. To achieve gang scheduling, the resources required by all applications running on a single resource must be less than its capacity. An example is a temporary hard disk space used by applications. If the requirement of all the applications running simultaneously is more significant than the capacity of the hard disk, gang scheduling will fail. Another such resource is the RAM used by the applications.

Gang scheduling allows a scheduler to improve responsiveness and utilization by permitting more jobs to begin executions faster. Shorter jobs can finish without waiting in queues for longer jobs to terminate, increasing the responsiveness of the overall cluster.

Gang scheduling is frequently used in data analytical applications.

List Scheduling

This is a classic and straightforward scheduling technique where it runs the next job that fits the available resources. Because of its simplicity, it can be implemented efficiently and even provides good schedules in practical systems.

Backfill Scheduling

The backfilling algorithm tries to execute jobs ahead in the job queue in case it cannot execute the head of the queue due to resources not being available. When it picks a job that is down in the queue, it tries not to postpone the job at the head of the queue. This requires knowledge of the job execution times for each job.

The above image shows the difference between FIFO and backfill scheduling. Imagine we have two resources (r) and four jobs. Jobs 1 to 4 take resources and time pairs of Job1 — {r = 1, t = 2}, Job2 — {r=1, t=4}, Job 3 — {r=2, t = 3}, Job 4 — {r=1, t= 2}. These are submitted in the order of 1 to 4. If we strictly follow the FIFO schedule, Job 4 must wait until Job 3 is completed. With backfill scheduling, we can start Job 4 after Job 1 is completed. Note that the Job 3 start time does not change because of this.