Daily Bulletin Archive

June 14, 2013

Our High Performance Storage System team reports that HPSS has been exceptionally busy with file transfer requests. The highest levels of activity tend to be between noon and 5 p.m. on weekdays, peaking on Wednesdays. During such periods, HPSS may hit its global transfer limit and, as a result, users are increasingly likely to experience “EIO” errors. These error messages indicate that either the user's individual transfer limit or the global transfer limit has been reached. 

We continue to monitor the system closely and are actively working to identify ways to improve the system’s performance and throughput.

Please see our HPSS documentation for information about the global and individual transfer limits. These web pages may also help you use the system more efficiently:

To provide some perspective, HPSS has seen an average of almost 240 TB of new data stored each week for the past five weeks, reflecting the impact of the Yellowstone system. In comparison, when Bluefire was in production new data stored per week was closer to 50 to 70 TB per week.

Thank you for your patience as we continue to monitor HPSS throughput and performance in order to identify ways to balance the system's reliability and usage.

June 6, 2013

To complete a workaround to address the ongoing issues with compilation and the license server, CISL consultants will be updating a number of compiler modules today.

Most users will not notice any impact. However, if you have saved a customized set of modules to load when you log in, you will get an error message regarding a problem with your modules configuration.

This is a known limitation of the current modules software, and affected users will need to re-create their default modules settings according to the instructions at http://www2.cisl.ucar.edu/resources/software/modules, under Customized Environments.

If you have questions, please contact cislhelp@ucar.edu.

June 5, 2013

The series of steps undertaken thus far at Juniper's instruction have not resolved the compilation issues that users have been experiencing on Yellowstone.

At this time, compilation and license checkouts are working across the system, but users may experience less than optimal performance. Two login nodes -- yslogin1 and yslogin2 -- are demonstrating the best performance at this time. Users doing significant compilations may want to log into those nodes directly. Users not using licensed software may want to move to the other login nodes (yslogin3-yslogin6).

The next fix will require a firmware upgrade of the central 6000-port Juniper switch for Yellowstone's management network. CISL is working with Juniper and IBM on this Severity 1 issue and will be scheduling a time to perform this work.

We apologize for the inconvenience.

June 5, 2013

HPSS:   Downtime Tuesday, June 4, 7:00am - 9:00am

No Scheduled Downtime: Yellowstone, Geyser, Caldera, GLADE, Lynx

May 30, 2013

To complete the fixes Yellowstone has been experiencing related to compilation and software licenses, CISL will reboot the Yellowstone license server at 2 p.m. today, May 29. While the node is rebooting, users will not be able to check out new software licenses for approximately 10 minutes. Neither existing checkouts of licenses nor running jobs should be impacted.

This step should be the last needed to restore compilation across the Yellowstone environment to its normal state.

To prevent the network issues that led to the compilation problems, CISL will be updating the firmware on the management network's central Juniper switch during a future scheduled downtime.

May 29, 2013

No Scheduled Downtime: Yellowstone, Geyser, Caldera, HPSS, GLADE, Lynx

May 29, 2013

CISL staff rebooted the main Ethernet switch for the Yellowstone management network this morning, May 28, at 8 a.m. MT. The reboot of the switch completed at approximately 9:20 a.m., which restored traffic to a more normal state, though CISL is continuing to follow up with Juniper.

CISL has confirmed that compilation and license access has been improved to all Yellowstone login, Geyser, and Caldera nodes.

Users continuing to experience problems with licensed software should contact cislhelp@ucar.edu.

May 28, 2013

We have a workaround in place in our license setup. Compilation will now work on login nodes 3 through 6, Geyser, and Caldera though at a slower speed than yslogin1 and yslogin2.

As the final fix for the compilation, we will be rebooting the main Ethernet switch for the Yellowstone management network on Tuesday morning, May 28, at 8 a.m. MT.

The reboot should not affect any running jobs or any ongoing user sessions with licenses already checked out. However, during the reboot, users will not be able to check out new licenses or start new compile tasks. Since a workaround is available, we are waiting until after the holiday weekend to ensure CISL staff can be on hand to monitor the system during and after the reboot.

May 24, 2013

The UCAR Software Engineering Assembly (SEA) and CISL Consulting Services Group (CSG) are offering High Performance Computing and Software Carpentry workshops from Tuesday, May 21, through Friday, May 24, to help participants acquire essential knowledge and skills for working with supercomputers. The workshops will be presented by CSG members and Alex Viana and Ted Hart from Software Carpentry. See CSG Summer Training for details.

May 23, 2013

The CISL Resource Status web page now shows near real-time activity in the Yellowstone environment’s job queues to help users identify opportunities to submit jobs and determine which queue to use when submitting jobs. Updated every three minutes, it displays the number of running and pending jobs for each queue, the number of nodes being used, and the number of active users. Also see Queues and charges for help with queue selection.