Daily Bulletin Archive

Aug. 27, 2013

NCAR researchers and eligible university researchers can now request "small" Janus allocations of up to 200,000 core-hours at any time, an increase from the previous limit of 50,000 core-hours for small allocations.

University researchers can request allocations of more than 200,000 core-hours as part of the semi-annual large allocation process. The next deadline is Sept. 16. See University Large Allocation Request Form. For small allocations, use the University Small Allocation Request Form.

NCAR staff can also request both small and larger allocations on Janus via the Janus allocation request form. Large allocations require a brief write-up of the technical readiness and justification of the computational request. NCAR researchers should use the Alternative Allocation Request Form.

More information is available here: http://www2.cisl.ucar.edu/resources/janus/allocations

Aug. 26, 2013

CISL staff will be conducting a two-part update to key Yellowstone system software components August 20 and August 27. As part of this update, Yellowstone, Geyser, and Caldera will be taken out of service August 27, from 6 am MT until 6 pm MT. The updates include fixes for a number of issues experienced by users.

Although not technically required, CISL's consultants strongly recommend that users recompile their codes following the August 27 downtime.

Of most interest to users, the updates to LSF and the IBM Parallel Environment (PE) include:

* Corrected wrappers for the PGI compiler;

* the fix for a bug with MPI_IN_PLACE in MPI_Allgather that some users have encountered;

* a fix that will allow Fortran codes with "USE MPI" statements to compile correctly under PGI and GNU compilers; and

* the LSF and PE versions needed to complete integration of the Pronghorn Xeon Phi cluster into the environment.

On August 20, CISL will perform the first part of the update, upgrading the xCAT administration software, which is a prerequisite to the LSF and PE updates. No outage will be needed if the upgrade process goes as planned. However, users should be aware of the slight chance that CISL staff may need to take the system down should they encounter problems.

On August 27, CISL staff will take the system down to upgrade LSF to version 9.1.1 and the IBM PE to version 1.3.0.4. The downtime is necessary since all the nodes must be rebooted to propagate all the changes.

During this period a number of other system firmware and software components will be brought up to date, but these will largely be invisible to users.

GLADE and HPSS will not be affected by the update process and are expected to remain in service throughout this period.

Aug. 25, 2013

An XSEDE training session for beginning and intermediate Linux/Unix users will be webcast from 1 to 4 p.m. Central time on Friday, September 6.

The Texas Advanced Computing Center will present the training session “Linux/Unix Basics.” XSEDE described it as an interactive lecture that will emphasize common strategies for interacting with clusters and HPC resources. It will include hands-on exercises. There are no prerequisites.

To register, see https://www.xsede.org/web/xup/course-calendar

Aug. 16, 2013

Users are asked to plan around the 2013 Community Earth System Modeling (CESM) Tutorial schedule August 12 to 16 to reduce potential contention for Intel compiler licenses.

Tutorial participants will be using Yellowstone’s six login nodes and four Caldera nodes for compilation between these hours:

  •  2:30 and 5 p.m. Mountain time on Monday, Tuesday, and Thursday

  • 1 and 3 p.m. on Friday

During these windows, 80 attendees will work in two-person teams, compiling and submitting CESM jobs. They will not be using PGI, GNU, or PathScale compilers, so those will not be affected.

The results of the tutorial compilations on most days will be small, short compute jobs that should have minimal impact on the availability of batch nodes for other users.

Aug. 13, 2013

Starting Friday and over the weekend, users may have experienced issues with interactive sessions on Yellowstone due to problems on two of the six login nodes.

Yslogin2 will be taken out of service today, Monday, August 12, 2 p.m. to 4 p.m., so that IBM can replace the system board on the node. The other five login nodes will remain available.

Yslogin4 was taken out of service Friday evening through Saturday morning to replace a failing InfiniBand adapter. User sessions were interrupted to complete the fix, and the node has been returned to service.

Aug. 8, 2013

HPSS:   Downtime Tuesday, August 13, 7:00am-9:00am

No Scheduled Downtime: Yellowstone, Geyser, Caldera, GLADE, Lynx

Aug. 4, 2013

Registration is open for the third annual Front Range HPC Symposium, which is August 13-15, 2013, in Laramie, Wyoming. The Front Range Consortium for Research Computing (FRCRC) also is accepting submissions for the annual poster competition and technical papers.

See http://www.frcrc.org/hpcsymposium for conference, registration, and submission details. Registration closes August 2.

The FRCRC is a group of universities and government labs, including NCAR, located near a region of the Rocky Mountains known as the Front Range. The FRCRC is a partnership that enables the partner institutions to collaborate in order to promote HPC (High Performance Computing) and share ideas for further collaboration.

Aug. 4, 2013

This week, CISL staff are performing a rolling upgrade to the Yellowstone, Geyser and Caldera systems to bring the GPFS client software on the clusters up to version 3.5.

Sets of nodes have been placed under several system reservations and will be taken out of service and restarted with the new client software. After passing health checks, the nodes will be returned to service.

Users should not be affected by the updates, other than perhaps slightly longer queue waits as the reservations and upgrade process reduce the number of nodes available to jobs. Users should consult CISL's documentation on backfill windows to maximize their throughput around the reservations; see http://www2.cisl.ucar.edu/resources/yellowstone/using_resources/runningjobs#bslots

These updates complete the transition to the most recent version of GPFS, which provides Yellowstone and GLADE with a number of features to improve the management of the disk resource.

UPDATE, Aug. 1, 11:00 am MT: The upgrades to the login nodes have been completed and the nodes returned to users.

Two of the six Yellowstone login nodes have already been upgraded. The remaining four log in nodes are scheduled to be updated between 10 a.m. and noon on Thursday, August 1. We will issue a screen message before bringing those nodes down. We recommend that you log into yslogin3.ucar.edu or yslogin5.ucar.edu instead of yellowstone.ucar.edu on Thursday morning to avoid this disruption.

Aug. 4, 2013

CISL experienced some issues with receiving and delivering email starting Tuesday afternoon, July 30, through mid-morning Wednesday, July 31. No email was lost, but incoming email, including those sent to the CISL Help Desk, was queued up and outgoing mail was delayed, including email from the Daily Bulletin and Notifier.

We apologize for any delays in our response to user tickets. We will be working to catch up after email service is restored.

Jul. 29, 2013

CISL, IBM, and Mellanox have begun preparations leading up to a multi-week downtime, starting early October 2013, during which the Yellowstone InfiniBand cables will be replaced. At this time, users should plan for Yellowstone being out of service for approximately three weeks in October.

After considering the various options, CISL and its vendor partners determined that replacing the cables was the most efficient and effective course of action to improve the ability of the system to support large-scale jobs over its lifetime. A large team from CISL, IBM, and Mellanox are engaged in extensive discussions and logistics planning effort to minimize the downtime needed. IBM and Mellanox staff visited NWSC last week to better understand the site-specific issues.

The current plan has the outage following these general phases:

  • A one- or two-day full downtime will be required to remove the existing cables. The system will be powered down to prevent debris from cut cables from being sucked into the fans in the nodes. CISL also plans to conduct some preventive maintenance in the NWSC central utility plant during this period.
  • Barring complications, we plan to return GLADE and the Geyser and Caldera clusters to service after the cable removal to permit users to conduct analysis, visualization, and data access tasks. (The InfiniBand cables for Geyser and between GLADE and the Geyser and Caldera clusters will be replaced in July and August, with limited downtimes.)
  • The recabling of the Yellowstone batch nodes will take approximately two weeks, with staff from Mellanox, IBM, and CISL contributing to the effort.
  • After recabling is complete, CISL expects to restore the full system to service without additional downtime.

Note that HPSS will remain available during the entire period, except during the NWSC utility maintenance.

We will update the user community with more details as the schedules solidify, but we are providing this early information so that users can adjust their computing plans for the fall. As usual, we will use the Daily Bulletin and Notifier messages for updates.

Pages