Thursday - Jan 19, 2023
Complex workflows have increasingly high computational and I/O demand and are composed of many communicating orchestrators, services, and simulations. The ECP ecosystem has observed an increasing pervasiveness of workflow systems to handle these classes of complex, distributed applications. The interplay of workflow technologies and HPC has been challenged by the fast rise of ML technologies: ML needs to be integrated within workflows to ease their development and portability; workflows must benefit from ML to scale and increase efficiency of executions; HPC must embrace workflows to democratize access to its resources, as workflow systems should better exploit the power of HPC systems; ML must adapt to HCP architectures to scale to real-world large settings, and HCP must adjust to ML requirements to provide the required computational resources to scale up to extremely large data sets. Furthermore, HPC resource allocation policies and scheduler designs typically provide a simple "job" abstraction instead of workflow-aware abstractions. As a result, it is difficult to run exascale workflows efficiently and conveniently on HPC systems without extending resource management/scheduling approaches. In this BoF, we will bring together researchers and DOE facility representatives from the workflows, HPC, and AI/ML communities that work on scientific research questions requiring large-scale, distributed, and AI-heavy computing at exascale as well as workflows that allow the integration of (physical) experiments and computational facilities. The session will discuss challenges, opportunities, and future pathways, and will seek input for a post-ECP community roadmap focused on the sustainability of HPC and AI workflows software and applications.
Specifically, the BoF session discussions will focus on the following questions:
Rafael Ferreira da Silva – Oak Ridge National Laboratory
Shantenu Jha (Moderator) - Brookhaven National Laboratory
Katie Antypas - Lawrence Berkeley National Laboratory
Todd Gamblin - Lawrence Livermore National Laboratory
Andrew Gallo - GE Research
Arjun Shankar - Oak Ridge National Laboratory
- Managing dynamic workflows at scale
- Near term we need to put in place services and capabilities in data centers that enable data management and movement in a portable way and de-emphasize filesystems as the main orchestration mechanism.
- (1) Support for diverse coupling of HPC and AI/ML components, and integration scenarios without wholesale refactoring of applications. Flexible frameworks that enable a range are needed. (2) Simplicity of deployment of ML/AI on HPC
- Data continuum to/from ML/AI workflows and traditional workflows
- Data, data, and data. Locality, provenance, movement and cost
- Cross facility integration; Robust standards for apis and interfaces
- Challenge Problems that integrate HPC Simulations and Measurement/Observations are: Digital Twins, Validation and Verification, Experiment Design or steering. What are the workflows that support these use cases? What are their differences?
- From Tuesdays talk different tasks in a workflow may need to be able to be submitted on different systems
- Leveraging ML/AI to reduce the turnaround time of completing a workflow.
- Interactive, human-in-the-loop model evolution/iteration is a must. How that fits into existing workflow engine designs is an interesting question
- Difficulty providimg infrastructure for workflows that require dependable realtime turnaround
- Interconnection of systems; Standardized APIs; Data management
- One issue is that training data is often generated by physics models. These models take a lot more time and resources than a typical ML model for either training or inference. So there is a “length scale” problem.
- Reproducibility - retain how and where the code was run (environment, parameters, configuration, dependencies, etc)
- Dealing with heterogeneity of compute; Analysis of performance; Finding the shortest path to a solution; Reproducibility of workflows; Policy driven behavior vs technology behaviors
- Funding specific software vs applications vs workflows
- If we can not have a unified funding stream for sustainability, we should encourage agencies and program managers to encourage participation and collaboration in common software infrastructures and processes.
- Building blocks that existing and future developers can use to customize/extend/ develop their own workflow capabilities.
- A cognizant and consistent workforce that understands the workflow
- Reaching critical mass
- Workflows that cross institutional or organizational domains of ownership/control. We need policies and support from DOE
- Future funding
- Availability of no-code or low-code tools for defining and managing a workflow that lowers the barrier of leveraging workflows.
- The key opportunity is to create a wealth of knowledge and reusable capabilities. With a critical mass of workflows, I foresee a path to autonomous AI/ML-driven exploration of scientific questions. One constraint is that we need to think about how sustain workflows at scale, at a population-level.
- Opportunities - containerized and cloud technology for resilience
- Standardized formats; Portability
- If workflows are complicated enough then managing the decencies becomes a challenge. You need to test all the interdependencies to ensure that the whole workflow keeps working.
- Opportunity: define benchmarks around workflows (and hope this benchmark lives like linpack did); Constraint: to some extent a workflow with science is often something that is implemented to something for the first time ( never done before experiment) the tools required for production workflows vs experimental workflows are not the same
- Workflow systems and data management systems
- Cloud and classic-HPC interoperability and portability
- Depends upon workflow applications; in addition to application kernels, it includes libraries, workflow and resource management middleware, automation capabilities, data management software etc.
- Collaboration. far more important than point solutions.
- Containers; Standardized apis and interface with a clear governance model
- Automated builds, tools like containers and space
- Package manager, build tools, Python, resource manager, GUI/CLI/API
- There's a need for cyberinfrastructures to enable the curation and maintenance of workflows, and policies for governance for workflows as an enduring institutional resource. There is ongoing work at Sandia on this front.
- Containers, cloud-like services, systems with failovers or ability to burst to cloud
- Support for Resource/Workload Managers; Workflow-aware schedulers
- Ultimately it would be nice if the scale could be flexible. Some steps may need only a couple of nodes others could use thousands. To avoid wasting resources it would be nice if different phases could use different resource allocations.