Daily Bulletin Archive

June 14, 2013

Allocations for some non-university projects are subject to 30- and 90-day thresholds as explained in our Allocation use and thresholds documentation. CISL will begin enforcing that policy on Monday, June 17.

The thresholds apply to several NCAR divisional allocations and a small number of projects that have very large allocations. No university projects are affected.

When usage exceeds the thresholds that apply to an allocation, LSF notifies users who submit jobs and redirects those jobs to the low-priority “standby” queue. The message includes the project code (for example, P12345678) and the statement: “Warning: Project group exceeds a 30/90 threshold.”

To check on the status of an allocation, log in to https://apps.weg.ucar.edu/reports with your Yellowstone username and your UCAS password. Select “Divisional Reports” and then the appropriate division.

Contact cislhelp@ucar.edu if you have questions.

June 14, 2013

CISL documentation regarding the Intel Math Kernel Library (MKL) of optimized math routines now includes OpenMP and MPI usage examples. In addition to the new parallel examples, the MKL documentation presents sample batch job scripts and procedures for accessing the numerous Intel examples on the Yellowstone system. See MKL: Math Kernel Library and contact cislhelp@ucar.edu if you have questions.

June 14, 2013

CISL, IBM, and Mellanox are currently discussing the possibility of a major downtime to Yellowstone to significantly improve the ability of the system's FDR InfiniBand interconnect to support large-scale jobs. As part of those discussions, we have been working to understand time-critical user needs for the Yellowstone system.

However, at this time, CISL is still in the information-gathering and risk-reward assessment stages, has not made a decision to pursue a particular course of action, and is working with IBM and Mellanox on competing strategies and the logistics of each. One scenario would entail a complete downtime of two weeks or more while another scenario attempts to reduce the amount of time the full system is unavailable, but lengthens the time involved for effecting the changes and may increase the risk of system instability in the interim.

CISL is considering both the immediate user impacts of the downtime and the scientific productivity of Yellowstone over its lifetime as part of the decision-making process. We are evaluating all possible options and alternatives to minimize the disruption to users.

We will provide more information as soon as we have decided how we will proceed and have a tentative time frame for the downtime.

June 14, 2013

Our High Performance Storage System team reports that HPSS has been exceptionally busy with file transfer requests. The highest levels of activity tend to be between noon and 5 p.m. on weekdays, peaking on Wednesdays. During such periods, HPSS may hit its global transfer limit and, as a result, users are increasingly likely to experience “EIO” errors. These error messages indicate that either the user's individual transfer limit or the global transfer limit has been reached. 

We continue to monitor the system closely and are actively working to identify ways to improve the system’s performance and throughput.

Please see our HPSS documentation for information about the global and individual transfer limits. These web pages may also help you use the system more efficiently:

To provide some perspective, HPSS has seen an average of almost 240 TB of new data stored each week for the past five weeks, reflecting the impact of the Yellowstone system. In comparison, when Bluefire was in production new data stored per week was closer to 50 to 70 TB per week.

Thank you for your patience as we continue to monitor HPSS throughput and performance in order to identify ways to balance the system's reliability and usage.

June 6, 2013

To complete a workaround to address the ongoing issues with compilation and the license server, CISL consultants will be updating a number of compiler modules today.

Most users will not notice any impact. However, if you have saved a customized set of modules to load when you log in, you will get an error message regarding a problem with your modules configuration.

This is a known limitation of the current modules software, and affected users will need to re-create their default modules settings according to the instructions at http://www2.cisl.ucar.edu/resources/software/modules, under Customized Environments.

If you have questions, please contact cislhelp@ucar.edu.

June 5, 2013

The series of steps undertaken thus far at Juniper's instruction have not resolved the compilation issues that users have been experiencing on Yellowstone.

At this time, compilation and license checkouts are working across the system, but users may experience less than optimal performance. Two login nodes -- yslogin1 and yslogin2 -- are demonstrating the best performance at this time. Users doing significant compilations may want to log into those nodes directly. Users not using licensed software may want to move to the other login nodes (yslogin3-yslogin6).

The next fix will require a firmware upgrade of the central 6000-port Juniper switch for Yellowstone's management network. CISL is working with Juniper and IBM on this Severity 1 issue and will be scheduling a time to perform this work.

We apologize for the inconvenience.

June 5, 2013

HPSS:   Downtime Tuesday, June 4, 7:00am - 9:00am

No Scheduled Downtime: Yellowstone, Geyser, Caldera, GLADE, Lynx

May 30, 2013

To complete the fixes Yellowstone has been experiencing related to compilation and software licenses, CISL will reboot the Yellowstone license server at 2 p.m. today, May 29. While the node is rebooting, users will not be able to check out new software licenses for approximately 10 minutes. Neither existing checkouts of licenses nor running jobs should be impacted.

This step should be the last needed to restore compilation across the Yellowstone environment to its normal state.

To prevent the network issues that led to the compilation problems, CISL will be updating the firmware on the management network's central Juniper switch during a future scheduled downtime.

May 29, 2013

No Scheduled Downtime: Yellowstone, Geyser, Caldera, HPSS, GLADE, Lynx

May 29, 2013

CISL staff rebooted the main Ethernet switch for the Yellowstone management network this morning, May 28, at 8 a.m. MT. The reboot of the switch completed at approximately 9:20 a.m., which restored traffic to a more normal state, though CISL is continuing to follow up with Juniper.

CISL has confirmed that compilation and license access has been improved to all Yellowstone login, Geyser, and Caldera nodes.

Users continuing to experience problems with licensed software should contact cislhelp@ucar.edu.