The Elephant in the Data Center – Energy Consumption of Workflow Executions

The execution of scientific workflows consumes electric energy – and often quite a lot as workflow systems usually are employed especially for large data sets, complex analysis tasks, and distributed execution over sizeable compute clusters. While techniques to reduce this consumption have been researched for decades, they currently reach a new level of urgency, as energy costs skyrocket (at least in Europe) and the world more and more acknowledges that the era of low-cost energy production from carbon-heavy fuels will have to come to an end. Thus, it is high time for any workflow researcher and workflow user to ask herself – what can we do to save energy? The answer to this question has multiple facets.

The classical response is “increasing efficiency of software and hardware”, and research in the past has almost exclusively focused on this idea. Efficiency in terms of energy can intuitively be understood as processing output per watt; increasing efficiency thus means producing more output for the same amount of energy or producing the same output at less energy (or, in an ideal world, both). Techniques typically focus on the latter, for instance by developing more energy-efficient algorithms (e.g. joule sort) or using more energy-efficient hardware (e.g. GPUs or FPGAs) for specific problems. More recently, the same idea is followed by works in machine learning, trying to achieve the same accuracy of predictions with much smaller models, which require less energy for training and application.

However, from the perspective of a large data center, these approaches all too often fail to actually save energy, simply because a data center typically always runs at full load – no matter how many efficiency-enhancing tricks are applied. When codes get more efficient, a given analysis is finished in less time (thus requiring less energy), yet the freed time is typically not used to power down the cluster, but rather to run some other analysis that otherwise might not have been prioritized; in commercial clouds, such idle times are sold over the spot market. From a purely economic point of view, this historically made a lot of sense, as acquiring a cluster was much more expensive than running it (ignoring the cost of human administration which must be paid at a constant rate anyway), and thus all clusters should be used to their maximum degree to achieve a good return on investment. At this point we must notice the important difference between “increasing energy efficiency” and really “saving energy”: More efficient algorithms achieve the former but not the latter when the point of view is widened from a single workflow execution to a data center perspective.

A similar effect can be observed with more energy-efficient hardware. Clearly, a data center built with modern low-energy chips can perform the same computation as one built with older high-energy chips at less energy, yet this effect is typically out-weighted by the trend that data centers simply become bigger and bigger and thus consume more energy again – an instance of Jevons Paradox. A similar effect exists with cars, where the average fuel consumption has been reduced only marginally over the last 20 years, despite the fact that more and more efficient engines were developed, simply because these are built into ever larger and heavier cars.

The local perspective – developing individual workflows

Now – what can workflow research do to save energy and cost? From a single-workflow perspective, there are a few obvious measures along the line we just discussed. We can save energy by using more efficient codes and by running the workflow on more efficient hardware. We can save cost by energy-aware scheduling, i.e., running the workflow in times when energy is cheaper, thus exploiting price differences offered by the local energy provider; in the future (and partly already in the presence), it is quite conceivable that energy prices will become cheaper whenever it is produced in abundance, for instance by solar collectors on sunny or wind power plants on windy days, or when consumption is low, for instance at night and in summer. Even if such scheduling may not save lots of money, it would certainly help to reduce carbon emissions by trading “bad” energy for “better” energy.

However, there are also less obvious measures for saving energy (and money), which revolve around the concept of “output” of a workflow. So far, we implicitly assumed that a workflow is a fixed set of computations turning some input data into a computational result. However, that is a very computer-science focused point of view – most users are much more interested in the scientific result than in the specific computation that was performed to produce it. This leads to the question whether we cannot have the same, or almost the same, scientific result with less computation? The answer very often is: Yes. One example was already mentioned before, i.e., the usage of smaller and thus lower-energy models in machine learning. However, in many types of analysis, many more ideas can be followed. For instance, classification or regression tasks often are evaluated in a k-fold cross-validation. The larger the k, the more energy is required, as k-fold cross-validation performs k rounds of training and testing. Thus, choosing smaller k’s typically saves energy; the price might be a less robust result. In NLP, training and fine-tuning in many tasks already have become so expensive that typical evaluations dropped the idea of cross-validation and use a single test partition instead; apparently, the “price” has been accepted by the community. Bootstrapping or Monte-Carlo-style analysis methods perform the same type of computations tens of thousands of times to achieve robust results; reducing the number of repetitions leads to a linear decrease in energy consumption, again at the cost of less robust results. Hyperparameter optimization also performs model training and evaluation repetitively with slightly different inputs (e.g. using grid search or Bayesian methods); again, reducing the number of runs leads to a linear decrease in energy consumption, at the peril of not finding a better parameterization and thus not having the best possible result. Iterative model improvements are pushed to their extremes in projects around AutoML, which often boil down to a hilarious number of trial and error computations with varying algorithms, input sets, hyperparameters, etc. Finally, sampling is a ubiquitous technique to reduce the computational cost of studying very large data sets; of course, smaller sample sizes also reduce energy costs. Note that none of these ideas leads to a provable “non-optimal” solution, but solutions typically improve (slightly) with more iterations or more samples; a workflow developer thus always has to decide at which point to stop. Probably, energy consumption should and will soon be a strong argument for stopping earlier; and the community should start accepting this argument when reviewing papers or project grants. An important start could be that we all start to report energy consumption figures with our analysis.

The global perspective – data centers running multiple workflows

What can a data center do to save energy or reduce energy costs? The most obvious measure is to reduce services, for instance by switching off a cluster partly or entirely. This clearly is an extreme action that probably no data center manager will favor (unless forced to after comparing the electricity bills with the available funding); however, it will undoubtedly immediately save energy and energy costs at the same time. A less drastic and popular measure is to use DVFS techniques, which possibly leads to losing some performance, or to switch to more energy-efficient hardware. Especially the latter, however, again carries the danger that pressure is growing to invest the savings in energy in extensions of the cluster (see above), as long as the bill still can be paid, which will deprive the energy savings. Furthermore, one must not forget the resources necessary to build the hardware of the cluster; using existing hardware (and not replacing it) as long as feasible can actually be a proper means to save energy, even if a new generation of hardware might be more energy-efficient than the present one. Finding the sweet spot within this trade-off is the daily business of data center managers and currently shaken due to the changing proportions of costs for acquisition, maintenance, and operation.

Another option is to put more effort into avoiding redundant computation across multiple workflows. In many fields of science, certain reference data sets exist that are used by many groups; examples in the life sciences are large genome data sets, e.g. The Cancer Genome Atlas, examples in remote sensing are series of images from satellites like LandSat; examples in astronomy are the images from large sky observations like the Sloan Digital Sky Survey. These data sets are typically published in a raw format, yet many downstream analyses require some form of pre-processing that is often very similar or even completely identical across multiple workflows. Performing such processing only once can save substantial energy, it requires either careful consultation of users or automatic means to identify identical computations across workflows; in a long-term vision, such pre-processing could even be shared across multiple data centers, not just multiple workflows. However, it must not be forgotten that keeping results for further access also costs energy (as long as it is not written to tape archives) to keep memory powered, SSDs responding, or disks spinning. Intelligent research data management and advanced workflow optimization thus can play an important role in saving energy.

Putting it all together – turning every screw

The necessity for saving energy and energy cost is no longer a topic of the future or important only in small niches – it has become an urgent need for all of us now. This also affects workflow research and research performed with workflows, as executing workflows is particularly energy hungry. We discussed several approaches to these problems that differ in their focus (saving energy versus saving money) and their scope (individual workflows versus data centers). There are surely many more ideas than presented in this short commentary; we would be eager to learn about them, for instance by personal mail (leser@informatik.hu-berlin.de).

From a 10,000-feet perspective, the approaches can all be considered variations of a common theme, extending the classical trade-off in high-performance computing (i.e., compute time versus compute costs) with a crucial third factor: energy consumption. We believe that this more complex landscape requires new lines of research in diverse areas such as scheduling, research data management, heterogeneous architecture, monitoring, and workflow optimization – now.

Jan 13, 2023

Ulf Leser (Humboldt-Universität zu Berlin)

Lauritz Thamsen (University of Glasgow)