TURNAROUND TIMES AT A SUPERCOMPUTING CENTRE

Ann-Marie Mårtensson-Pendrill
Dept. of Physics
S-412 96 Gothenburg , Sweden
Int. J. Supercomputing Applications and High Performance Computing, 9:4, p 312-314 (1995)

ABSTRACT

"Computer-dynamic laws" are suggested that severely affect the turnaround time at a computing facility available to a large group of users. A dynamical equilibrium may be established, whereby users migrate to whatever computer available to them which gives them the best possible turnaround time, as long as it continues to do so. Unless actions are taken to prevent this equilibrium, the effective performance of a Supercomputing Centre may thus be limited to the performance of alternative computers available.

A national Supercomputing Centre is expected to enable users with large-scale computing needs to perform tasks that would otherwise be out of reach. To this end, the Centre provides computers with much better performance than would otherwise be available to the user. The actual processing speed of the computer is given priority, even if this may be at the cost of a reduction in total capacity, since supercomputers are an expensive technology (see e.g. Saini and Bailey, 1995 ). However, due to mechanisms discussed below, the better turnaround may, in fact, not be achieved, in spite of the sacrifice of total capacity.

Of interest to the user is, of course, not primarily the processing speed, but the actual time from submission of the job until its completion. The ratio between the turnaround time and the CPU time used depends on several factors, including the number of active users and the total load on the system, in a way that seems difficult to predict. Some insights into this problem may, however, be gained by studying statistics from existing Centra.

Available statistics from the Swedish National Supercomputer Centre (NSC), which may be typical, shows that the ratio between the turnaround time and the CPU time is around 6 ± 2 remarkably independent of job size (Mats S. Andersson, msa@nsc.liu.se, private communication). The error bar reflects the spread between different queues, whereas the ratios for individual jobs, of course, vary much more. Obviously, the system is overloaded. (More funds are now being made available for Swedish High-Performance Computing through the Council for High Performance Computing (HPDR) ) A more interesting observation is that the resulting turnaround time is comparable to the time which would be required to run a job (when possible) on a single-user HP-9000/735 workstation (Dongarra 1995 , Koski et al 1994 ), which is among the most powerful workstations available to individual Swedish research groups. This can, of course, be a mere coincidence. On the other hand, it may be an indication of mechanisms, which can be summarized in two "computer dynamic laws" (similar to the laws of thermodynamics).

Increasing Entropy:

Each computer available to a large group of users is in a dynamic equilibrium with other computers available to the users.

The equilibrium is established if a sufficient part of the user community has a choice of several options to run their codes. These users are then likely to migrate to whatever computer available to them which gives them the best possible turnaround time - but only as as long as it continues to do so. If the goal were to establish a national "Metacomputer", this equilibrium would be highly desirable, ensuring that at each time a job submitted would be run on whatever hardware available that could be expected to finish the job in the shortest possible time after submission. To achieve such an equilibrium automatically would demand quite detailed knowledge about how well each job would perform on the various architectures available. An equilibrium involving an MPP, e.g., would, of course, depend very much on the degree of parellelism for the different jobs. The equilibrium observed here, however, arises solely through the choices of each individual user, who will have acquired at least some knowledge about the performance of the relevant program on the various architectures on which it has been used. The willingness of the user to migrate depends on several factors, including an adequate interactive environment - program development and testing raise particularly strong demands on the response time for interactive work. For some applications the migration will be prohibitive: the software may be extremely biased toward a particular architecture or the code may, e.g., require too much memory or storage capacities to fit in a workstation environment. For many applications, the migration is a slow process, whereas for some jobs, a choice can be made at the time of submission.

This migration of users will ensure that the time required on a workstation will be an upper limit to the turnaround times at a supercomputer centre, but, if uncontrolled, the migration may lead to turnaround times at supercomputers actually reaching this limit and being similar to those at a workstation, which, in general, was not the intention when the centre was established. Further, the equilibrium may make average turnaround times at a supercomputer centre independent of how many processors are assigned to each job. We can express these pessimistic predictions in terms of a conservation law:

Conservation of Elapsed Time:

In equilibrium, the turnaround time at a facility is comparable to and limited by the turnaround time at alternative computers available to a significant fraction of the users.

The laws suggested here would have serious consequences for the effective performance of a computing centre, unless actions are taken to prevent the establishment of an equilibrium. This is a challenge to anyone running such a Centre, and only a few possible actions are discussed here.

Probably, the most efficient tool would be a demonic Program Committee, acting as a "Maxwell demon". (Maxwell devised his demon as a way to manipulate molecular velocity distributions and decrease entropy, thereby violating the second law of thermodynamics. The demon would inspect the molecules and allow only the fast ones to enter a region, whereas only slow molecules would be allowed to leave it.)

In addition to obvious criteria in reviewing applications for computing time, such as scientific quality, need for fast computing and/or large memory and storage requirements, as well as suitability for the computer architecture at hand program committees may thus want to add the applicant's access to adequate alternatives as an important factor to consider. (A few users without fast alternatives can of course be accepted without serious consequences for the equilibrium.)

Groups with access to computers of their own, or shared with only a few others, can establish a local equilibrium at a higher performance level than elsewhere. Directed computing funding to specific groups (always popular among researchers themselves) may therefore be very effective.

This short note is written in the hope that it will stimulate further investigations into the nature of the dynamical equilibria established at various Supercomputing Centra. Such investigations could also provide a basis for shared insights about viable strategies adopted by programme committees. However, even in the absence of data from other centra, the mechanisms suggested here lead to the conclusion that a balanced funding of computers with various levels of performance is crucial in order to enable peak facilities to provide peak performance!

Acknowledgments

I would like to express my thanks to Mats S Andersson, NSC, for providing and clarifying the data that sparked the ideas presented here. The ideas arose from a bed of thoughts developed in a process where several people have been very patient in listening to inarticulate ideas, and I would like to thank, in particular, Hans Wallberg, Leslie Pendrill, Axel Ruhe, Gustaf Söderlind and Anders Ynnerman for helpful comments and stimulating discussions. Superimposed on these discussions I have also recalled Pat Sandars' recurring disbelief in coincidence as an explanation for numbers being similar.

Dongarra, J.J. 1995.: Performance of Various Computers Using Standard Linear Equations Software, available e.g. at ftp://netlib2.cs.utk.edu/benchmark/performance.ps
Koski, K., Saarinen, S., and Serimaa, O. 1995: "Supercomputer Benchmarking", CSC Research Reports R01/95, CSC , Helsinki, Finland
NSC , 1994.: National Supercomputer Centre at Linköping University, Biannual activity report July 1992-June 1994. The Swedish National Supercomputer Centre was founded in 1983, and at presents runs production jobs on a 5 processor Cray-YMP. The system runs mostly in single-processor mode, in order to maximize the total throughput. About 75% of the capacity is available to the academic community, with about 50 projects including 130 users currently active.
Saini, S. and Bailey, H.B. 1995.: NAS Parallel Benchmark Results 3-95, available at links found in http://www.nas.nasa.gov/NAS/NPB/

Turnaround Times at a Supercomputing Center
http://fy.chalmers.se/~f3aamp/turnaround.html
Ann-Marie.Pendrill@fy.chalmers.se, 1995