Darwin latest news

March

24/3-: popcorn has crashed three times in a row with "Multiple fan outage." As service contract is expired, it therefore will take time to deal with the problem...

2002

November

25-26/11: Due to power outage both back-end servers, unicorn and popcorn, were rendered unbootable (again:-). After hardware service servers are back up again. I took this opportunity to upgrade the system software on all servers.

June

18-20/6: Due to the thunderstorm and the following the power outage on the evning of the 18th both unicorn and popcorn got some problems. In the afternoon of the 20th all the visible problems seem to be solved (some hardware need to be changed but it does not interfere with normal operation).
unicorn lost 2 cpus and 4 supervisor units. The cpus has been changed.
popcorn lost contact with several harddisk drives and needed manual restart. One supervisor unit was broken.
The supervisor units will be changed later this summer.

February

13/2: Now it was unicorn's turn to fall victim for same kind of attack. Yes, that's what it basically is, an attack! Do monitor your jobs and book resources correctly.
4/2: We had to reboot popcorn to break memory allocation deadlock condition. The condition was caused by overbooking [memory] resources. All users are (and always were) challenged to book resources correctly.

January

25/1: Power failure throughout whole campus. Opportunity taken to upgrade back-end servers to 6.5.14f

2001

August

14/8: unicorn is back up and accepting batch jobs. I took the chance to upgrade OS and Message Passing Toolkit on it.
12/8: unicorn has crashed because of hardware fault. The matter is being investigated.

March

22/3: unicorn has crashed with /hw/module/2/slot/n1/node/cpubus/0/b: Cache Error (Transient scache error) message. After cycling the power and flashing PROM everything was normal and the machine stood the excessive memory trashing test... So it was taken back to queue...

2000

December

27/12: OS was upgraded to 6.5.10, compilers were upgraded to 7.3.1.2m. Both are bug-fix upgrades. Mail (most notably from qdaemon:-) was erroneously delivered locally on newton (meaning that you can read it on newton e.g. with mailx command) over the holidays, but now it functions as it used to again.
20/12: unicorn is back up and running. popcorn is scheduled for OS upgrade for this Friday, December 22nd.
19/12: Both backends (unicorn and popcorn) were rebooted because of software problems with automounter. Sadly enough unicorn failed to come up and therefore stays down for hardware maintenance. I'm taking this opportunity to upgrade OS (which will fix the problem it suffered from in first place). popcorn (and eventually newton, the front-end machine) will be upgraded later on (next week?).

July

30/7: popcorn has crashed with somewhat spiritual error message. It's up and running now...
11/7: unicorn is "back" but further maintenance is needed. The service is scheduled for Monday, 17th of July.
10/7: The crash was caused by failure of a power supply unit. unicorn is back now with 56 CPUs and is booked for hardware service for tomorrow morning. Latter means that only short jobs are accepted.
7/7: unicorn is unreachable, presumably crashed. Unfortunately we won't have access to the computer room till Monday and therefore won't be able to provide details until then. queueing system runs on popcorn only.

June

20/6: popcorn is up and running again. One switch had a circuit failure and stopped all networktraffic to/from the computer.
19/6: unicorn and popcorn crashed. ( This time it's was two PS on unicorn and one at popcorn that malfunctioned. All three PS did start after flipping the switch.) And there seems to be some problem with the network to popcorn, so we had to disable popcorn until we have connection again.

May

23/5: unicorn has crashed again:-( This time it's memory module on another node... Unfortunately they didn't have that large modules here in Gothenburg and the machine stays down till tomorrow morning.
19/5: unicorn has crashed with "PANIC: /hw/module/7/slot/n1/node/cpu/b: Cache Error (too many primary instruction cache error exceptions) Eframe = 0x8b8." All jobs running on it were naturally ceased. It's back accepting jobs...

February

24/2: unicorn has crashed with "PANIC: /hw/module/2/slot/n2/node/cpu/a: Write error. PhysAddr 0x54451e000". It's back accepting shorter jobs till SGI states if it's something to be fixed.
15/2: unicorn has finally got the missing 6GB of primary memory back and is accepting jobs after down-time today in the morning.

January

7/1: We run IRIX 6.5 on all (front- and back-ends) servers now. Compilers were as well upgraded to version 7.3. Previous version of compilers is still available if you issue 'setenv TOOLROOT /opt/MIPSpro/721' at the command prompt. Online documentation isn't available just yet, in the mean time consult the SGI online library.
7/1: Today newton, the front-end machine, will be taken down for software upgrades at 17:00. Batch jobs running at back-end servers won't be affected. I don't expect downtime be longer than 2 hours, but after unicorn's meltdown I keep my fingers crossed. Availability informtation will be posted here.

1999

December

29/12: For the 27th of December unicorn was scheduled for OS upgrade. As result the machine was unfortunately rendered to unbootable state. It's second meltdown which makes SGI support scratch their heads. But the machine is back up now running IRIX 6.5.6, though with less primary memory (9 instead of 15GB). Further service is scheduled for the 11th of January next year.
13/12: popcorn (new 40x250MHz R10000, 20GB RAM) is accepting queue jobs.

February

5/2: As of today we have started to automatically kill all jobs that run for an overtime more then 10 % of the sceduled time.

January

25/1: unicorn is booked for hardware service for January 26th, 9:00. Jobs that won't due till than will be held in queue.
25/1: unicorn has crashed this weekend. We investigate possible causes. Till we figure out if it's necessary to order/perform any hardware service routines, unicorn is taking only short jobs.

1998

September

9/9: Another faulty PSU was replaced today and unicorn is back accepting (even long) queue jobs.
1/9: Another power supply unit at unicorn has failed and we were forced to cycle the power. Strangely enough the unit came back... The fault is reported to SGI and unicorn is preliminary booked for service, i.e. accepts only short jobs for the moment.

August

31/8: unicorn hang at no visible reason and therefore was rebooted. We'll have closer look at crash dumps later today.
26/8: unicorn is back in 64 CPU configuration. As most of you probably noticed we had severe problems with queue system after reboot, but it's fixed now. We apologize for caused inconvenience.
24/8: 5th power supply unit at unicorn just died. We've taken the machine up in 56 CPU/8 GB RAM configuration for now...
4/8: Compilers are upgraded to MIPSpro Compilers: Version 7.2.1. Biggest news is Fortran compliance with Open MP specification.

July

28/7: We provide some statistics on the net. See the SGI main page.
2/7: Due to a power failure in 'fysikforskarhus', newton was down. This, however, didn't affect the batched jobs on unicorn. The power is back and everything should work now.; The switch at MDC had some troubles. Unicorn were unreachable. Rebooted the switch and it worked again.

May

25/5: unique, a mailrelay to UNICC had a breakin. Did an upgrade of the OS to Solaris 2.6.
20/5: unicorn has crashed, the problem is being worked on.; Applied patches to unicorn and newton. Newton has been upgraded to 1024 Mb of RAM

April

27/4: Because of power outage frontend system will be shut down today at 17:00. Back up tomorrow morning. Batch jobs are not affected.
22/4: A fire outside of "Forskarhus fysik" caused a power outage in the server room belonging to the Physics Computer. Therefore newton will be down tonight 22/4 and probably on Thursday too.
8/4: "Misalign" bug strikes again? Yesterday one program compiled with -align32 flag managed to hang one CPU and became immortal. Today more processes became immortal and some I/O functions, e.g.: file system sync, stopped working. We have to reboot the whole system, and have to do it ungracefully. With main power switch I mean. Last year SGI claimed that the bug was worked around in kernel...
2/4: There actually was more "damage" done. The queue daemon died as a result of broken communication with unicorn:-( The problem was discovered surpisingly late! As we have a lot of things to do we actually count on you, guys, reporting problems. I beleive it could have been brought it to our attention much earlier!
2/4: Due to unexpected power outage newton was unavailable for about 1/2 hour today. In case you wonder...

March

27/03: newton was moved to another location today and now is back up again. It has now also got 512MB of RAM.
19/3: unicorn will be taken down for hardware service. Unforutnately all running jobs will be terminated:-(
18/3: unicorn went down because of a broken memory module. It is up again, but the module must be replaced.
13/3: unicorn is back after hardware service accepting your batch jobs.
3/3: unicorn won't get any hardware service this week, but next week. Wednesday, 11th of March, SGI will take the system down for 3 days in order to perform in-depth diagnostics.
1/3: uncorn has crashed with <0>PANIC: /hw/module/7/slot/n1/node/cpu/b: Cache Error (too many secondary cache error exceptions). It's back up again taking jobs, but the machine is preliminary booked for hardware service this Wednesday, so that only short jobs are accepted till it's clear when it gets service.

February

9/2: unicorn is back after hardware service, but still in 62 CPU configuration. newton still misses 128MB.
5/2: This time newton has crashed:-( One memory bank has been blown out. It's back up with 384MB RAM, we'll take it down next week for proper hardware maintenance.
3/2: unicorn has crashed again... With different panic message (<0>PANIC: /hw/module/8/slot/n2/node/cpu/b: Cache Error (Transient scache error) if you have to know:-), but it was originated from the same nodecard as last time. I've disabled corresponding nodecard and brought system up in 62 CPU configuration. It's back in queue under old conditions, namely "jobs that are due to finish by the beginning of the next week". This might be changed without long notice if SGI manage to deliver spare parts before end of this week. Cheers, if you find anything cheerfull about it...
3/2: newton is back in 4 CPU/512MB RAM configuration. unicorn is accepting long jobs that are due to finish by the beginning of the next week...
2/2: unicorn has crashed presumably on Saturday with Uncorrectable data array error and then Error while holding previous error. The system was brought up and SGI was notified about the fault. Long jobs are being held in waiting queue until it's clear if we have to take the system down for hardware maintenance this week. nelson still runs on half capacity, namely 2 CPUs and 256MB RAM, take it easier on it...

January

30/1: newton crashed around 13:00 with TLBMISS: I/STHREAD FAULT panic message, but never came up... We had to disable half of the machine in order to bring it up. So it runs at half CPU/RAM capacity and will be taken down some time next week for proper hardware service.
22/1: The system is back after maintenance and upgrades. Unicorn has now reached its maximum thickness, which makes it the first 64 CPU computer with shared memory architecture in Sweden. Nelson has finally died... Origin 200-based newton was just born in 4CPU, 512MB RAM configuration. If you can't login on newton.unicc.chalmers.se or access your home directory, contact your system administrator and ask him/her to include newton.unicc.chalmers.se into NFS export group.
22/1: The whole system is down for hardware maintenance and upgrades. Coming back later today after 16:00...
20/1: Unicorn went down and did not come up again! The crash was caused by hardware fault in CrayHub. This Thursday (980120) the whole system will be taken down for hardware service and futher upgrades.
17/1: Unicorn went down and did not come up. The crash was caused by IO board hardware fault. It's back with less swap space. After cycling the power one CPU disappeared. Some time next week the whole system will be taken down for service and further upgrades. Namely, nelson's CPUs are going to unicorn, which should give 64 CPU configuration. Nelson in turn will get 4 x 180MHz x 1MB $ R10000 CPUs of its own.

1997

December

11/12: unicorn is back after service. We hope it stands up longer now:-) So far (3 hours uptime) there is no indication of problem, at least not similar to yesterday. Give it a shake! nelson is missing one CPU now...
10/12: Well, unicorn crashed 17:30 after not more than 4.5 hours uptime:-( The problem is presumably caused by malfunctioning router card. But not by one of those exchanged recently! In case you wonder why routers "break" that often lately. They probably don't! We simply start using ports that have never been used before. As most waiting jobs in queue are long and it's rather late I take a decision to leave queueing system down till unicorn gets appropriate service tomorrow morning.
10/12: There is no popcorn left anymore:-( Well, unicorn got substantially "thicker"... It's back taking queue requests in 56 CPU/9GB RAM configuration from 15:00. Files from popcorn:/stor will be made available shortly...
3/12: unicorn is back after hardware upgrades. popcorn was rebooted as well in order to replace malfunctional router card. All 56 CPUs are now back in batch queue. Later this month we plan yet another major downtime. The plan is to join unicorn and popcorn into one 56 CPU machine.

November

26/11: unicorn's crash yesterday night was caused by malfunctional Cray router card. We had to borrow similar card from popcorn and proceeded with the upgrade. As a result popcorn comes back in 20 CPU configuration. We schedule unicorn upgrade for the next week. By then we'll probably have to reboot popcorn as well in order to replace the router card and get those 4 missing CPUs back.
25/11: unicorn just crashed (20:45) and never came up...
25/11: nelson is back after hardware upgrades, but popcorn isn't:-( Tomorrow SGI will proceed with the popcorn upgrade.
25/11: nelson and popcorn are taken down for hardware upgrades. nelson is back at ~12:00, popcorn is back in queue around 17:00.
21/11: Because of an unexpected major power failure, all SGI servers went down at around 14.00.
14/11: Between 13.00 and 17.00 unicorn, popcorn and nelson will be unavailable. Version 7.2 of the compilers will be installed, as well as the latest patch set (required for systems with more than 32 processors.)

October

31/10: At 18.21 the NFS server term17.tfd went down, causing kernel hangs on unicorn. Therefore no new batch jobs can be started on unicorn or popcorn until the server is brought up again. term17 is not maintained by the UNICC staff.
14/10: Support for interative jobs has been added to the queue system. For more information, read about the -i option in the qsub manual page.; It is now possible to run long jobs (up to two weeks of execution time) on seven nodes on popcorn.
9/10: It is no longer allowed to log in interactively on popcorn, because its 24 processors are now controlled by the scheduling system 'fair que'. Batch jobs submitted from nelson will be run on either unicorn or popcorn, depending on which machine becomes available first. With the -q option in qsub it is possible to force the job to be run on a particular host.
2/10: Batch jobs should be submitted from nelson from now on. There are currently two queues: popcorn and unicorn. Unicorn is the default queue; the other queue is accessed using the "-q popcorn" flag. It is likely that the two batch servers will be merged later this year.

September

30/9: nelson has got OS/application pool synched with the rest of the system. Plan is to have it as the only machine available for interactive login. popcorn will be moved under queueing system control within a week period.

23/9: popcorn had to be rebooted after a system hang. The hang was probably caused by a failed SCSI bus. Around 19.00 popcorn will be brought down so that the swap disks can be moved to another bus.
22/9: The scheduling system on unicorn now gives exclusive access to processors for serial jobs. It also tries to match the memory reserved against the amount of free memory on the nodes, choosing the "best" cpu for the job. The benefits are faster execution of serial jobs and more predictable execution times.
10/9: popcorn has got bad memory modules replaced. One node card was moved to nelson with intention to run some enhanced disagnostics within couple of days. So don't bet on nelson too much:-)

5/9: "Real" queue system accounts were added for a few users. From now on the use of unicorn and popcorn will be charged for according to the amount of cpu hours and memory GB hours consumed. The estimated cost is 1-2 kr/cpu hour + 8 kr/GB hour.

August

28/8: The scheduling system on unicorn currently reserves 16 processors and 2.5 GB of memory for short jobs Monday-Friday 08:00-17:00. Short jobs are jobs with execution time less than four hours. In addition one processor is reserved for serial test jobs with execution time up to 15 minutes. The background is that users from a number of departments have complained about the long turn-around time for short jobs. After having considered a number of possible solutions, the technical group of UNICC decided to test-run the new queue system configuration (which resembles the EASY queue system setup at PDC ).
15/8: popcorn was rebooted because the root file system needed to be rebuilt. Also nelson had to be rebooted.
14/8: The scheduling system 'fair que' has been updated. It is now possible to check how much memory a job is using and the cpu utilization of the job. A number of environment variables have been added, and there is support for multiple execution servers. For more information, see the manual pages for qsub, qdel and qshow.

July

31/7

2 nodecards at unicorn were replaced so it got all 32 CPUs back. popcorn got new memory modules and disk. SGI service was so kind to leave old disk in place till we shuffle data. It might be possible to preserve /stors! /users/phc are being moved to popcorn. unicorn is getting 25GB /stor.

29/7

popcorn crashed with "<0>PANIC: CPU 7: TLBMISS: I/STHREAD FAULT". SGI support claims that patch #1984 fixes the problem. After having a look at popcorn, I found unicorn in somewhat sad state: some immortal processes hanging around (kind of bad omen), mount claims no local file systems mounted... Decided to reboot and patch up both systems. During patching something went wrong with popcorn. It's not possible to reboot, one of /etc/rc*.d/K* procedures never completes. Configuration of new kernel fails with following message:

ld: ERROR 26: Jump out of 256 megabyte memory area. 156-th relocation entry, file failover.o jumping to .rel.text (hwgraph_traverse)
ld: INFO 60: Output file removed because of error.
lboot: ld returned 512--failed

Looks messy, huh? For the time being I've simply copied unicorn's kernel to popcorn. Queing system doesn't work after reboot. Investigating the matter... Looks working now!

On Thursday both systems will be taken down for hardware maintetance (bad CPU on unicorn, bad memory modules and disk on popcorn). By then we'll rearrange disk storage. All files in all /stors will vanish! Take backups!!!

17/7

On-line documentation for the 'fair que' scheduling system written, available here.

16/7

The scheduling system 'fair que' was started on unicorn. From now on new jobs should be submitted from popcorn using 'qsub' and interactive logins on unicorn should be avoided.

June

29/6

"Getting started" information was added to the web site. Please send comments and suggestions on the contents of this page to help@fy.chalmers.se.

26/7

It seems that it was actually HUB0 that failed. It was not possible to boot the computer with that node card in, because it didn't manage to perform CrayLink discovery. The card is disconnected for the moment. In addition to this one CPU (in another node!) failed to pass power-on diagnostics ("scache" failure). Corresponding node card is disconnected as well for safety sake. This means that unicorn is running at 28 CPUs.

23/6

A processor on unicorn stopped working at about 23.00. It was not possible to boot the machine afterwards, so we are waiting for service personnel.

12/6

According to the syslog file there are problems with one of the disks that make up the /stor partition on popcorn. If your program reads or writes data on /stor and fails unexpectedly, please send an e-mail to 'help' and tell us about it.

11/6

popcorn went down at 18.00. The reason for the failure was probably a bad memory module. The module has caused panics twice before, so it should be replaced.

10/6

SGI says that the 'kiva' bug is caused by an unfortunate instruction sequence that the compiler generates when the -align flags are used. Therefore you should avoid using these flags (-align8, -align16 or -align32) for the time being. A few programs may not work correctly without them, however, so make sure you check the results.

5/6

popcorn got broken node card replaced. While being patched it simply hang. After cycling power it didn't manage to come up because of early crash during startup sequence with

<0>PANIC: CPU 6: XFS aborting dirty transaction 0xa800000300410f00

At the very same time (due to presence effect:-) unicorn crashed with

<0>PANIC: /hw/module/3/slot/n3/node/cpu/b: Cache Error (unrecoverable error, too many cache error exceptions) Eframe = 0x8d0

After being brought up both systems were patched up with latest mumbo-jumbo patches from SGI.

2/6

unicorn crashed at 22.20 with segmentation violation in NFS server code, presumably handling NFSv3 request from nelson.

May

29/5: The two broken processors on unicorn are replaced.
27/5: popcorn is up again, but one node card is disabled. SGI has ordered a replacement.
26/5: popcorn went down, and after cycling the power it fails early in the start-up sequence. The reason for the failure is unknown so far.
SGI thinks the 'kiva' bug is caused by a chip failure. Engineering is working on a workaround.
25/5: unicorn is back up again. However, only 30 processors are working. SGI support is informed.
24/5: unicorn went down due to problem with a PCI interrupt (this error had not been seen before.)
19/5: popcorn crashed due to a memory module failure.
5/5: unicorn had to be rebooted. The reason for the failure is unknown so far. Sometimes popcorn feels hopelessly slow. It looks like there are problems with the kernel scheduler.

April

28/4: unicorn and popcorn will be down from 9.00 to 12.00 due to planned service.
23/4: There are processors hanging on both unicorn and popcorn, so the machines have to be rebooted today. The reason for the failure is known, but SGI has no fix for it yet.
19/4: popcorn is considered fully operational now, although it has only 24 processors at the moment. If you still encounter bugs or deficiencies in system software, please send a notice to help@unicc.chalmers.se
17/4: Today SGI support personnel have been looking at unicorn and popcorn trying to understand what causes the problems with hanging compiler components and unreliable NFS access. We have applied a few more patches and changed the configuration of NFS, so hopefully the machines will be more stable.
9/4: popcorn and possibly even unicorn will be unaccessible while support personnel from SGI tries to analyse what causes the frequent OS failures (compiler problems, processes hanging etc.)
4/4: As many of you have noticed, there are problems with NFS on unicorn and popcorn. It also happens that processes hang and get immortal. The only way to recover is to reboot the system. SGI knows about our problems.
Currently popcorn has only 24 processors, because the nodes boards that were responsible for hardware faults last month are now running on their own.
There are /stor disks at both unicorn and popcorn. They may be used for storing temporary data that is needed by active jobs. Please clean up when the data is no longer needed.

March

17/3: The acceptance test period has started. unicorn and popcorn should now be fully operational, so please report any deficiencies in hardware or system software to help@unicc.chalmers.se
14/3: Two bad node boards on popcorn were replaced, and more disks was installed. Both unicorn and popcorn are now fully equipped with respect to cpus and disks.
12/3: popcorn went down today due to a hardware failure on one of the bad node boards.
10/3: popcorn, the other 32-cpu machine, is now available for interactive use.
Users at the School of Chemistry can now get an account on unicorn. Contact Björn Sandell for more information.
7/3: unicorn will be rebooted at 13.30 because of board swaps (we are going to gather the bad node boards on popcorn in order to see if this makes unicorn fell better.)
6/3: So far we have had five similar hardware failures on unicorn and popcorn (the second 32 cpu machine.) The pattern seems fairly clear now; all failures have affected node boards with revision J (four boards out of 32 have this revision.) SGI support will have to come up with a solution soon, because the acceptance test period will start next week. Today I put two cpus off-line in order to keep the machine running over the weekend.
5/3: unicorn went down at 22.10.
4/3: unicorn crashed at 20.20 because of a hardware error within node 0. Unfortunately no logs were created, so the exact reason for the failure is unknown.
2/3: unicorn went down in the morning because of a hardware fault on one of the cpu boards. It looks like the cpu received data that had been corrupted during the transfer from a remote memory to the local cache.
1/3: unicorn is available for interactive use.

February

During January and February we cannot make any promises about uptime and availability. We are running a prerelease of the operating system, and not all hardware has been delivered yet.

28/2: unicorn will be rebooted today after 13.30 because more disk space will be installed. Meanwhile you can listen to a seminar given by Fredrik Hedman, PDC, about "Parallel Manybody Algorithms for Molecular Dynamics." Read more about the talk here.
27/2: unicorn is back available for interactive logins. Some programs are missing. emacs and mathematica are first that come to mind. Coming back shortly.

26/2: Today unicorn is down for system upgrades. The failing cpu has been replaced, and we have got licences for the "Workshop" package (development and profiling tools.)
24/2: unicorn went down several times due to hardware errors within one of the nodes. Interactive logins have been disabled again. Test runs with one of the cpus offline indicate that many of the system crashes during the last few weeks can actually be blaimed on the failing cpu, i.e. good news!
20/2: unicorn is open for interactive use again. A kernel patch have been installed, so the system should be more stable now.
13/2: unicorn crashed several times today, so interactive logins are disabled again. Diagreports were available, and SGI support says it is a known problem related to SCSI-disks that will be fixed next week when the new version of the OS arrives.
12/2: The patches have been installed as well as MPI and PVM software. See 'man 5 mpi' or 'man pvm_intro' for more information. The happy hour continues...
11/2: SGI will send patches today that should make the machine more stable. unicorn will not be available for interactive use until after they have been installed.
8/2: Unicorn crashed 20.34 and did not recover automatically. So far the reason for the failure is unknown.
7/2: Power C compiler has been installed (see 'man pca')
7/2: Reboot due to scheduler problems again.
6/2: The machine had to be rebooted due to process scheduler problems. Newly created processes stayed in the run queue forever.
3/2: Unicorn went down due to problems with NFS version 3 client software. In order to stabilize the machine all home directories will be mounted using NFS version 2 instead.

January

A machine with 32 cpus has been started up and compilers (C, f77, f90) are installed. Everybody with a UNICC account and a NFS exported home directory can login and run programs interactively. There are no passwords on unicorn (sending unencrypted passwords over the Chalmers network is not a good idea anyway), so users must have access to 'ssh' in order to login. MPI is not installed yet.

This page is maintained by Lennart Bengtsson, Fysikdatorn