Machine learning (ML) algorithms have been successfully applied to a wide variety of commercial applications. However, adoption of ML in traditional high performance computing (HPC) applications has lagged other segments because of differences in software ecosystems and cluster management tools. Specifically, these applications are typically written in Fortran, C, and C++ and leverage OpenMP and/or MPI parallelization to extract peak performance on HPC platforms. On the other hand, ML toolkits are generally Python-based. Moreover, HPC systems use advanced workload managers to schedule and manage bare-metal jobs. The SmartSim library (https://craylabs.org) was developed by Cray and HPE to overcome these challenges and help users create novel HPC workflows that mesh modern data science with machine learning capabilities.
SmartSim is an open-source software package developed by Hewlett Packard Enterprise (HPE) that helps users combine traditional engineering, scientific, and high-performance computing applications with machine learning tools and visualization packages. Internally, SmartSim consists of two libraries: an infrastructure library and a client library. The infrastructure library’s API enables users to configure scientific applications and to define and launch experiments (potentially including ensembles of simulations). As part of the machine learning infrastructure, SmartSim launches an in-memory database accessible across all applications in the workflow. Machine learning models can be executed inside the database or data can be streamed out of it for visualization and training. Machine learning infrastructure can be launched on CPU, GPU, or multi-GPU compute nodes, and the SmartSim library has features to help users achieve maximum workload throughput with all of these systems. SmartSim supports launching user-described workflows on several HPC workload managers including Slurm, PBS, LSF, and Cobalt. SmartSim also allows users to launch workflows locally on MacOS or Linux systems.
The SmartSim client library, for its part, provides interfaces that can be embedded into any Fortran, C, C++, or Python application. With these clients, applications can transfer data to and from the in-memory database and they can execute ML tasks (such as inference) on this data. Also, because the database is accessible by all component applications in the workflow, data can be streamed into and out of the database for online analysis and ML model training. Further, the client library offers optimized aggregation methods that retrieve data from the database for tasks such as online visualization and online training. The SmartSim client library needs only a TCP/IP connection to communicate with the database so it is highly portable.
Within the oceanographic community, the behavior of the water in oceans is simulated via ocean general circulation models (OGCMs). OGCMs use the Navier-Stokes partial differential equations (PDEs) to represent the laws of conservation of mass, momentum, and energy. Parameterizations added to these PDEs account for physical processes that occur at scales smaller than the model grid, but their unconstrained use can violate the conservation laws. Modern ocean modeling theory considers eddy energy (EKE) to be a critical variable in these parameterizations. The current state of the art numerically solves a PDE that describes EKE; however, the estimation or omission of a number of terms leads to significant errors. To improve the quality of OGCMs, collaborators from HPE, the Canadian Centre for Climate Modelling and Analysis, and the National Center for Atmospheric Research (NCAR) used a data-driven approach with machine learning to replace the EKE PDE. SmartSim is the key technology that allowed this ML model to be meshed with existing OGCMs. As a result of these efforts, we were able to improve the simulation accuracy without impeding performance in large-scale climate simulations.
In this work, we evaluated the results of our ML approach to the numerical approach using the Modular Ocean Model version 6 (MOM6). We compared the EKE estimated by each approach to the baseline EKE calculated by a third ‘eddy-resolving’ (ER) simulation with a spatial resolution of about 10 kilometers. This resolution is sufficient to resolve mesoscale eddies (turbulence) in the equator and mid-latitudes, but it is computationally expensive. The other two configurations coarsen the spatial resolution to 25 kilometers, which places them into an ‘eddy-permitting’ regime where modeling the impact of eddies on the rest of the simulation requires turbulence parameterization. The coarser configurations differ only in their eddy parameterization models: one uses the Mesoscale Eddy Kinetic Energy (MEKE), which represents the current state-of-the-art. The other uses “SmartSim-EKE”, which is based on a neural network trained to infer the eddy kinetic energy.
In our demonstration, we used SmartSim’s ensemble capabilities to run 12 parallel SmartSim-EKE simulations thereby demonstrating the scalability and performance of SmartSim. In so doing we were able to evaluate the accuracy of EKE parameterizations, and estimate overall computational cost at scale. We computed each ensemble member for 10 simulated years on an aggregate total combining 10,920 Intel Skylake and Cascadelake processors cores along with 16 Nvidia P100 GPUs dedicated to the shared in-memory database. Figure 2 summarizes the MOM6 ensemble workflow and the translation of SmartSim workflow descriptions to launched jobs on the hardware cluster.
We compare SmartSim-EKE and MEKE to the “true” EKE calculated from ER in the images below. As shown in Figures 3 and 4 below, SmartSim-EKE tends to overestimate the extent of the equatorial (0 latitude) EKE, but otherwise provides a much more accurate predictions of large-scale trends compared to the MEKE state-of-the-art predictions. Importantly, these gains in accuracy from the SmartSim-EKE ML model did not incur significant computational penalties relative to the state-of-the-art. Specifically, we observed only a 1.5% slowdown in the simulation – negligible on modern systems with heterogeneous compute nodes and higher GPU counts. These results show that machine learning infused into traditional HPC workloads can provide significant improvements in accuracy and/or performance. Moreover, SmartSim enabled these ML-driven gains without extensive changes to the HPC application source code or requiring maintenance of a one-of-a-kind ML infrastructure. More detailed information on the accuracy and performance of this SmartSim use case can be found in the Journal of Computational Science (https://doi.org/10.1016/j.jocs.2022.101707).
To stay up-to-date on new SmartSim features and release, please favorite (star) the SmartSim GitHub repository: https://github.com/CrayLabs/SmartSim.