Running all jobs at the same time, even though there may be enough physical memory to support it, is not a good way to use the SGI servers because of their distributed memory system. It also makes the average running time longer and moreover not additive, i.e. time required to execute two jobs on same CPU "simulteneously" is more than time required to execute same jobs one after another.
The Physics Computer undertook to introduce a scheduling system on the SGI system. The goal was to schedule both large-scale parallel jobs and serial jobs efficiently without overloading the machine. The scheduler should also allocate the available resources fairly between different users and groups. We chose between several alternatives: NQE (provided by SGI), LSF (another commercial queueing system), QUE (the queueing system that is used on the SUN stack) and some home-grown alternative. It became obvious that none of the existing products fulfilled our needs without extensive modifications, so Urban Engberg (who wrote QUE) and Lennart Bengtsson designed the scheduling system 'fair que'.
More information about 'fair que'
Manual pages:
UNICC/HPCC traditionally does not provide local home directories for its users. This means that the scratch disk /stor on unicorn may only be used to store data that belongs to currently running jobs. When a job has finished the used disk space should be freed at once, so that the next job does not fail due to an overfull disk. The scheduling systems support disk space reservation, but there is no quota system running yet that can enforce disk usage limits. Even with disk quotas turned on, files do not disappear by themselves, and only the user can decide which files are safe to remove.
Lennart Bengtsson/Andy Polyakov
Last updated 1998-03-05