Daily Bulletin Archive

June 5, 2013

The series of steps undertaken thus far at Juniper's instruction have not resolved the compilation issues that users have been experiencing on Yellowstone.

At this time, compilation and license checkouts are working across the system, but users may experience less than optimal performance. Two login nodes -- yslogin1 and yslogin2 -- are demonstrating the best performance at this time. Users doing significant compilations may want to log into those nodes directly. Users not using licensed software may want to move to the other login nodes (yslogin3-yslogin6).

The next fix will require a firmware upgrade of the central 6000-port Juniper switch for Yellowstone's management network. CISL is working with Juniper and IBM on this Severity 1 issue and will be scheduling a time to perform this work.

We apologize for the inconvenience.

June 5, 2013

HPSS:   Downtime Tuesday, June 4, 7:00am - 9:00am

No Scheduled Downtime: Yellowstone, Geyser, Caldera, GLADE, Lynx

May 30, 2013

To complete the fixes Yellowstone has been experiencing related to compilation and software licenses, CISL will reboot the Yellowstone license server at 2 p.m. today, May 29. While the node is rebooting, users will not be able to check out new software licenses for approximately 10 minutes. Neither existing checkouts of licenses nor running jobs should be impacted.

This step should be the last needed to restore compilation across the Yellowstone environment to its normal state.

To prevent the network issues that led to the compilation problems, CISL will be updating the firmware on the management network's central Juniper switch during a future scheduled downtime.

May 29, 2013

No Scheduled Downtime: Yellowstone, Geyser, Caldera, HPSS, GLADE, Lynx

May 29, 2013

CISL staff rebooted the main Ethernet switch for the Yellowstone management network this morning, May 28, at 8 a.m. MT. The reboot of the switch completed at approximately 9:20 a.m., which restored traffic to a more normal state, though CISL is continuing to follow up with Juniper.

CISL has confirmed that compilation and license access has been improved to all Yellowstone login, Geyser, and Caldera nodes.

Users continuing to experience problems with licensed software should contact cislhelp@ucar.edu.

May 28, 2013

We have a workaround in place in our license setup. Compilation will now work on login nodes 3 through 6, Geyser, and Caldera though at a slower speed than yslogin1 and yslogin2.

As the final fix for the compilation, we will be rebooting the main Ethernet switch for the Yellowstone management network on Tuesday morning, May 28, at 8 a.m. MT.

The reboot should not affect any running jobs or any ongoing user sessions with licenses already checked out. However, during the reboot, users will not be able to check out new licenses or start new compile tasks. Since a workaround is available, we are waiting until after the holiday weekend to ensure CISL staff can be on hand to monitor the system during and after the reboot.

May 24, 2013

The UCAR Software Engineering Assembly (SEA) and CISL Consulting Services Group (CSG) are offering High Performance Computing and Software Carpentry workshops from Tuesday, May 21, through Friday, May 24, to help participants acquire essential knowledge and skills for working with supercomputers. The workshops will be presented by CSG members and Alex Viana and Ted Hart from Software Carpentry. See CSG Summer Training for details.

May 23, 2013

The CISL Resource Status web page now shows near real-time activity in the Yellowstone environment’s job queues to help users identify opportunities to submit jobs and determine which queue to use when submitting jobs. Updated every three minutes, it displays the number of running and pending jobs for each queue, the number of nodes being used, and the number of active users. Also see Queues and charges for help with queue selection.

May 23, 2013

Storing large files in your GLADE file spaces is more efficient than storing numerous small files. This is because the system allocates a minimum amount of space for each file, no matter how small. On /glade/scratch, for example, the smallest amount of space the system can allocate to a file is 128 KB. Any files smaller than 128 KB are still allocated 128 KB, so they require more space than you might expect.

See CISL best practices for more details and to learn how to make the best use of your computing and storage allocations.

May 17, 2013

The /glade/p file system was returned to service and remounted on the Yellowstone nodes at 8:30 p.m. May 16. Pending jobs resumed running and most completed overnight.

During the afternoon, while /glade/p was down users may have noticed jobs switching repeatedly from a RUN to PEND state and back again. This was due to LSF proactively working to keep jobs off of nodes that did not have the /glade/p file system mounted. All such jobs should have run successfully once /glade/p was remounted. We will be looking at ways to allow jobs to safely run with only some of the GLADE spaces mounted.