Complex workflows have increasingly high computational and I/O demand and are composed of many communicating orchestrators, services, and simulations. The ECP ecosystem has observed an increasing pervasiveness of workflow systems to handle these classes of complex, distributed applications. The interplay of workflow technologies and HPC has been challenged by the fast rise of ML technologies: ML needs to be integrated within workflows to ease their development and portability; workflows must benefit from ML to scale and increase efficiency of executions; HPC must embrace workflows to democratize access to its resources, as workflow systems should better exploit the power of HPC systems; ML must adapt to HCP architectures to scale to real-world large settings, and HCP must adjust to ML requirements to provide the required computational resources to scale up to extremely large data sets. Furthermore, HPC resource allocation policies and scheduler designs typically provide a simple "job" abstraction instead of workflow-aware abstractions. As a result, it is difficult to run exascale workflows efficiently and conveniently on HPC systems without extending resource management/scheduling approaches. In this BoF, we will bring together researchers and DOE facility representatives from the workflows, HPC, and AI/ML communities that work on scientific research questions requiring large-scale, distributed, and AI-heavy computing at exascale as well as workflows that allow the integration of (physical) experiments and computational facilities. The session will discuss challenges, opportunities, and future pathways, and will seek input for a post-ECP community roadmap focused on the sustainability of HPC and AI workflows software and applications.
Specifically, the BoF session discussions will focus on the following questions:
- In the era of exascale and ML/AI, what are the emerging and future crucial challenges for post-ECP workflows?
- Considering workflows sustainability, what are the key constraints and opportunities to attain sustainability?
- Which software/technology are essential for enabling sustainable workflows post-ECP?
Rafael Ferreira da Silva – Oak Ridge National Laboratory
1:40pm-2:15pm — Panel Discussion
Shantenu Jha (Moderator) - Brookhaven National Laboratory
Katie Antypas - Lawrence Berkeley National Laboratory
Todd Gamblin - Lawrence Livermore National Laboratory
Andrew Gallo - GE Research
Arjun Shankar - Oak Ridge National Laboratory