2023 ECP Annual Meeting BoF

Exascale Workflows Community: A Post-ECP Roadmap for Workflow Systems and Applications

2023 ECP Annual Meeting BoF

Thursday - Jan 19, 2023
1:30pm-2:15pm CST
Founders 3+4

Complex workflows have increasingly high computational and I/O demand and are composed of many communicating orchestrators, services, and simulations. The ECP ecosystem has observed an increasing pervasiveness of workflow systems to handle these classes of complex, distributed applications. The interplay of workflow technologies and HPC has been challenged by the fast rise of ML technologies: ML needs to be integrated within workflows to ease their development and portability; workflows must benefit from ML to scale and increase efficiency of executions; HPC must embrace workflows to democratize access to its resources, as workflow systems should better exploit the power of HPC systems; ML must adapt to HCP architectures to scale to real-world large settings, and HCP must adjust to ML requirements to provide the required computational resources to scale up to extremely large data sets. Furthermore, HPC resource allocation policies and scheduler designs typically provide a simple "job" abstraction instead of workflow-aware abstractions. As a result, it is difficult to run exascale workflows efficiently and conveniently on HPC systems without extending resource management/scheduling approaches. In this BoF, we will bring together researchers and DOE facility representatives from the workflows, HPC, and AI/ML communities that work on scientific research questions requiring large-scale, distributed, and AI-heavy computing at exascale as well as workflows that allow the integration of (physical) experiments and computational facilities. The session will discuss challenges, opportunities, and future pathways, and will seek input for a post-ECP community roadmap focused on the sustainability of HPC and AI workflows software and applications.

Specifically, the BoF session discussions will focus on the following questions:

In the era of exascale and ML/AI, what are the emerging and future crucial challenges for post-ECP workflows?
Considering workflows sustainability, what are the key constraints and opportunities to attain sustainability?
Which software/technology are essential for enabling sustainable workflows post-ECP?

Agenda

1:30pm-1:40pm — Opening remarks

Rafael Ferreira da Silva – Oak Ridge National Laboratory
1:40pm-2:15pm — Panel Discussion

Shantenu Jha (Moderator) - Brookhaven National Laboratory

Katie Antypas - Lawrence Berkeley National Laboratory

Todd Gamblin - Lawrence Livermore National Laboratory

Andrew Gallo - GE Research

Arjun Shankar - Oak Ridge National Laboratory

Participants Contributions

In the era of exascale and ML/AI, what are the emerging and future crucial challenges for post-ECP workflows?

Managing dynamic workflows at scale

Near term we need to put in place services and capabilities in data centers that enable data management and movement in a portable way and de-emphasize filesystems as the main orchestration mechanism.

(1) Support for diverse coupling of HPC and AI/ML components, and integration scenarios without wholesale refactoring of applications. Flexible frameworks that enable a range are needed. (2) Simplicity of deployment of ML/AI on HPC

Data continuum to/from ML/AI workflows and traditional workflows

Data, data, and data. Locality, provenance, movement and cost

Cross facility integration; Robust standards for apis and interfaces

Challenge Problems that integrate HPC Simulations and Measurement/Observations are: Digital Twins, Validation and Verification, Experiment Design or steering. What are the workflows that support these use cases? What are their differences?

From Tuesdays talk different tasks in a workflow may need to be able to be submitted on different systems

Leveraging ML/AI to reduce the turnaround time of completing a workflow.

Interactive, human-in-the-loop model evolution/iteration is a must. How that fits into existing workflow engine designs is an interesting question

Difficulty providimg infrastructure for workflows that require dependable realtime turnaround

Interconnection of systems; Standardized APIs; Data management

One issue is that training data is often generated by physics models. These models take a lot more time and resources than a typical ML model for either training or inference. So there is a “length scale” problem.

Reproducibility - retain how and where the code was run (environment, parameters, configuration, dependencies, etc)

Dealing with heterogeneity of compute; Analysis of performance; Finding the shortest path to a solution; Reproducibility of workflows; Policy driven behavior vs technology behaviors

Considering workflows sustainability, what are the key constraints and opportunities to attain sustainability?

Funding specific software vs applications vs workflows

If we can not have a unified funding stream for sustainability, we should encourage agencies and program managers to encourage participation and collaboration in common software infrastructures and processes.

Building blocks that existing and future developers can use to customize/extend/ develop their own workflow capabilities.

A cognizant and consistent workforce that understands the workflow

Reaching critical mass

Workflows that cross institutional or organizational domains of ownership/control. We need policies and support from DOE

Future funding

Availability of no-code or low-code tools for defining and managing a workflow that lowers the barrier of leveraging workflows.

The key opportunity is to create a wealth of knowledge and reusable capabilities. With a critical mass of workflows, I foresee a path to autonomous AI/ML-driven exploration of scientific questions. One constraint is that we need to think about how sustain workflows at scale, at a population-level.

Opportunities - containerized and cloud technology for resilience

Standardized formats; Portability

If workflows are complicated enough then managing the decencies becomes a challenge. You need to test all the interdependencies to ensure that the whole workflow keeps working.

Opportunity: define benchmarks around workflows (and hope this benchmark lives like linpack did); Constraint: to some extent a workflow with science is often something that is implemented to something for the first time ( never done before experiment) the tools required for production workflows vs experimental workflows are not the same

Which software/technology are essential for enabling sustainable workflows post-ECP?

Workflow systems and data management systems

Cloud and classic-HPC interoperability and portability

Depends upon workflow applications; in addition to application kernels, it includes libraries, workflow and resource management middleware, automation capabilities, data management software etc.

Containers

Collaboration. far more important than point solutions.

Containers; Standardized apis and interface with a clear governance model

Automated builds, tools like containers and space

Package manager, build tools, Python, resource manager, GUI/CLI/API

There's a need for cyberinfrastructures to enable the curation and maintenance of workflows, and policies for governance for workflows as an enduring institutional resource. There is ongoing work at Sandia on this front.

Containers, cloud-like services, systems with failovers or ability to burst to cloud

Support for Resource/Workload Managers; Workflow-aware schedulers

Ultimately it would be nice if the scale could be flexible. Some steps may need only a couple of nodes others could use thousands. To avoid wasting resources it would be nice if different phases could use different resource allocations.

2023 ECP Annual Meeting BoF