Status

From Wiki

Jump to: navigation, search

There is a ganglia server running at http://www.cs.unc.edu/ganglia that shows the current load on the nodes.

Contents

Outstanding Issues

InfiniBand bandwidth for larger packets on connections originating on comp nodes is operating about about %40 of what it could be. We have a ticket open with QLogic attempting to resolve this.

When MPI jobs are deleted via qdel, they sometimes keep running on the grid nodes.

--Eddale 12:09, 29 June 2009 (EDT)

Previous Issues

MPI is now running fine through the grid engine, even when running jobs larger than 100 nodes in size.

Planned Downtimes

The first Wednesday of every month is tentatively planned as a maintenance day. A posting will be made to the mailing list detailing the anticipated downtime in advance of the maintenance.

Upgrades

Tentative update schedule

Wednesday November 11th, 7:30am

We will be re-installing file server bass-thor that contains the /stage and /molecules data partitions. Host bass-thor will be upgraded from Red Hat 5.3 to 5.4 and native Red Hat infiniband OFED based drivers configured. The /stage and /molecules partitions will be preserved. Remember these partitions are NOT backed up! Please make sure you copy you data somewhere in case there is an issue with the re-installation. We will be making an offline copy of this data but this is a reminder that this space is not backed up on a regular basis. The file server upgrade is expected to take about 3 hours.

Wednesday December 9th, 7:30am

Will be re-installing file server bass-files that contains the /home and /nanoscratch data partitions. Host bass-files will be configured to use the native Red Hat infiniband OFED based drivers. These data spaces are backed up on a daily basis. The file server upgrade is expected to take about 4 hours.

GridEngine 6.2u4 should be available as an RPM by this time and will be installed.

Personal tools