Extracting Information from Job Statistics

(Ann-Marie Pendrill) for HPDR, Sept 1996

(This text is intended as explanation for the figures provided on separate pages)

The job statistics from a computing center contains enormous abounts of information. The data can easily be used to produce graphs with overwhelming detail, without providing further insight. On the other hand, quoting only a few simple parameters leaves both systems people and users with a feeling that "This is not really how it is ". Would it be possible to find a limited set of parameters that can summarize the job statistics in a satisfactory way?

Obviously, the total load on the systen should be monitored. However, the desire to maximize utilization must be balance against the users need to get acceptable turnaround times. The "turnaround ratio" between the time from submission to completeion and the time to run a job on a single processor seems to be a reasonable quantity to provide. For a parallel system, such as the SP2 Strindberg at PDC, we can write the ratio as

[(queue time)+(connect time)] / [(connect time)*(number of nodes)]

The first figure shows this ratio versus the "connect time" for all jobs run in the batch queue at PDC during August 1996. Each dot in the Figure represents one individual job. This graph seems rather hard to interpret. In general, we can expect large ratios for jobs that run significantly shorter than estimated at the time of submission. The second figure therefore shows instead turnaround ratios versus requested time. In order to limit somewhat the complexity of the plot, only jobs requesting 4 nodes (and less than 240 minutes - the limit for daytime runs) were included, (a total of 232 jobs). Even so, the situation seems rather complex. The last figure on p1 shows the distribution of queuing times for all jobs requesting 4 nodes, showing that about C=65% of the jobs never waited at all and only 15% of the jobs waited more than 100 minutes. Two attempts at parametrizing the data are shown, and

Wq(t)=(1-0.20*exp(-t/90)-0.15*exp(-t/240))

giving a reasonably good fit.

Distributions of connect times, requested times and queueing times

(Second page of plots)

The second page of plots shows the distributions of connect (run) times, requested (max) times and queueing times. Here, connect times and max times were multiplied by the number of nodes to show the total capacity used. The first plot shows the distribution for all jobs. (All distributions have been sorted according to increasing time, but the same number in the sorted lists does not correspond to the same jobs.) The next two plots separate out the jobs running more /less than 10 minutes. The shorter jobs have a queuing time comparable to the running time - in general, these jobs can be assumed to have "failed" in some way, since there are also interactive queues for short jobs.

The last plot on the second page shows the situation for jobs requesting 16 nodes and running for more than 10 minutes. Of these jobs, about half go into the computer without waiting at all.

Distributions of turnaround ratios

(Third page of plots)

One possibility to simplify the data would be a direct evaluation of the mean of the turnaround ratio, as defined above. However, this puts unreasonably large emphasis on short jobs. (From analytical queuing theory, one finds, in fact, that the expectation value of the turnaround ratio is divergent for a FCFS (first-come-first-serve) queuing system, due to the large ratios for infinitesimal jobs.) A possible alternative is to quote the ratio of the average values, rather than the average of the ratios. An example of the difference, I quote results from the jobs during a week in August (Week 34) at NSC (the "t15hm12" queue). The ratio between the average CPU-time and the average turnaround time was 0.8, whereas the average turnaround ratio was 0.6 and the average inverse turnaround ratio was 31. For the 40 shortest jobs in this queue, the difference is even more striking: The ratio between the averages is 0.45, the average tunraround ratio is 0.4 and the average inverse ratio is 55.

The third page of plots shows the distribution ofturnaround ratios for the same sets of jobs as on the second page. A ratio of 1 corresponds to immediate running on one node. A smaller ratio indicates a speedup (assuming perfect parallelization). We see that of the jobs requesting at least 10 minutes, 80% got a ratio less than 1, and about 70% a ratio less than 0.5. For the larger jobs, running at 16 nodes, a large fraction obtained a factor of 1/16, and 75% a factor 1/5 or better. These jobs had an average time in queue of 393 minutes, (with q=762 min), and an average "connect" time of 227 min ( c=274 min). (We note that for a random distribution =<T> and for an "M/M/c" queueing system q / <T> q= ((2-C)/C)1/2, where C is the probability of finding all servers occupied. See also my notes on queuing theory)


Addition: 20 Sept 1996 Miscellaneous parameters from PDC. Aug, only a limited number of requested nodes have been studied.
Comparison of queuing times and probabilities to wait less than 1, 2, 3, minutes


Ann-Marie.Pendrill@fy.chalmers.se, Sept 1996
http://fy.chalmers.se/~f3aamp/hpdr/stat.html