Fair Que

Introduction to the scheduling system 'fair que'

This document is intented to give an overview of the 'fair que' scheduler. The ideas behind the design are presented and motivated, and the current implementation is discussed. The author assumes that you are familiar with basic UNIX concepts such as commands and files.

Motivation

The capacity of the current computer hardware is large but not unlimited, so not everyone can run their favourite programs at the same time. The main hardware resources that the machine provides to the users are processors, memory and local disk space. Different programs (called 'jobs' from now on) can have very different needs; e.g. a serial job may be able to run with very little memory, whereas a large parallel jobs may need several GB of memory. Also the execution time varies a lot, from a few minutes to several weeks. Because the computer hardware is expensive, we want to keep all available resources busy most of the time. However we must avoid overloading the machine, because that can reduce the performance considerably for all users.

Finding the best schedule, given a mix of jobs to run, is a tough problem. To start with, you must be able to foresee the future, so that you know how long each job will run. Also each user has his own opinion about the "best schedule", i. e. the schedule that allowes his jobs to run first. Third, new jobs enter the job pool continuously, so the schedule has to be dynamic.

Design

Because the scheduler itself cannot predict the future, the design of 'fair que' is based on the assumption that the user can at least estimate how long his or her job will run. Also the number of processors, the amount of memory and the local disk quota needed must be specified when the job is submitted. The scheduler maintains a time ordered list of events which make up the current schedule. Events are

configuration changes (adding or removing processors and/or memory, disk space)
job start events
job end events (estimated from the specified execution time)
accounting summary dates

Each job start event removes the reserved resources from the pool of free resources. Job end events releases them so that other waiting jobs can acquire them and become runnable.

The scheduler is run periodically. Each time it runs the currently running jobs are added to an empty schedule. Then the list of waiting jobs is traversed in priority order. In this process each waiting job is inserted in the schedule as early as possible, subject to the constraint that the available resources are never over utilized. Note that a small low priority job may be started before higher priority jobs if there happen to be a gap in the schedule that is large enough for the small job, but not so large that the other jobs can run. In this context 'small' means short execution time and/or low resource requirements.

Priority model

As stated above, waiting jobs are considered for execution in priority order, but we never defined how we should sort the jobs. The simplest model (and the one chosen at PDC in Stockholm) is to sort the jobs according to submit dates. However this model has some shortcomings in our environment. KTB (konsortiet för tunga beräkningar) is a collaboration between several departments at CTH, and each department should be guaranteed a fair share of the machine. Also within each department it might be desirable to limit how much cpu time each user can use, to prevent a single user from "stealing" the machine from other group members.

The priority model which was designed and adopted by 'fair que' is hierarchical. A typical user name (called account from now on) is composed of the department name, an optional group name and the UNIX user name. The account name /fy/tfy/urban for example says that urban belongs to the tfy group, and tfy is a member of the fy (physics) department. Each group and each user is assigned a portion of the available resources (cpu hours, memory MB hours and disk MB hours) and as long as he/she/the group has not used up these resources during the current accounting period, the user or the group is said to have high priority. Otherwise the group/user has low priority. Note that a high priority user may belong to a low priority group, and vice versa.

The procedure for finding the highest priority waiting job is as follows:

Each user "votes" for one of his jobs (if he has any.) The one chosen is the first submitted one.
Each group finds out which of its member users have high priority. Among the votes the job with the earliest submit date is chosen. If there are no prioritized votes, votes from low priority users are considered.
The process (2) above is repeated at the department level. This means that jobs belonging to high priority groups will be scheduled for execution before jobs from low priority groups.
Finally, the process (3) above is repeated at the highest level (the "root" level.) Now departments are compared to each other. Jobs belonging to users belonging to groups belonging to high priority departments have precedence. (I guess a figure could help here.)

This procedure is repeated until there are no waiting jobs left to schedule.

Accounting system

'Fair que' keeps track of the resource usage for all users during the current accounting period. Finished jobs as well as currently running jobs are included. There are two purposes of the accounting system: 1) to allocate a fair share of the machine to each user/group and 2) register statistics which is used later when Lars Andersson calculates how much each group should pay for their share. It is estimated that the cost per cpu hour will be 1 to 2 SEK. The cost will probably be based on memory and disk consumption too.

Currently each accounting period lasts for one month. It is not possible to transfer unused cpu (mem, disk) hours between different accounting periods.

User commands

The following user commands are described in their manual pages:

qsub: This command is used to submit a new job.
qdel: This command is used to remove or cancel a job.
qshow: This command shows various information about jobs and users.

Notes:

The priority field in the 'qshow' output can only be fully understood if you have read and digested the "priority model" section above. qshow outputs the full account name of the job written with mixed upper and lower case, for example "/FY/tfy/TFYLB". User and group names written with capital letters have high priority, whereas lower case letters are used for users/groups with low priority. In the example the user tfylb is within his limits. Also the physics department has not yet used up its share. tfylb has however, together with other members of the 'tfy' group, used up tfy's share, so jobs belonging to other physics groups have precedence.