How much CO2 is generated by deep neural networks and AI?
Even as governments and industries seek to limit greenhouse gas emissions, through standards and legislation, there are very few studies of carbon footprint at the individual systems level.
Why is this important? Effective change means altering end-user behaviour, and that implies understanding our personal, local behaviour. Managing climate change starts with us—as a collective, and as individuals.
As a group, data scientists are responsible for deploying significant new compute capacity. In the AI world, data scientists train their systems on huge quantities of data, with almost zero thought about the carbon footprint.
How are data scientists informed about the CO2 impact of their actions, and how do they enforce CO2 limits at the AI data center?
If you want to manage it, measure it
The logical first step is to measure the energy consumed by systems where deep neural networks are trained. These systems can run for several months, offering good insight into usage and emissions.
For measuring the CO2 emissions during deep learning training, I chose a popular model that scales well across multiple systems, called ResNet50 v.1.5. To scale the model I used Horovod with NCCL backend, and support for IBM DDL run utility for seamless integration with the workload manager.
To complete the tests, IBM-MIT Lab ran the model on their new Satori cluster, one of the largest AI clusters at MIT, which is based on IBM AC922 compute nodes. We modified the cluster slightly to allow us to measure the energy for each training job even as we supported normal operations.
We ran the training job by starting with 4x V100 GPUs 32GB HBM2 SMX2, incremented in four-fold steps until we reached ¼ of the cluster capacity, which equals one compute POD (16x AC922s). Training jobs were run in ‘exclusive’ mode, and systems were dedicated to a single training job, to eliminate power consumption for other workloads running on the same compute node.
What is the AI and deep learning carbon footprint?
Essentially, there are no surprises. Systems consume more power as we request more GPUs (from 8 to 64), and training jobs take longer before more GPUs are utilized. For ResNet50 v1.5, the performance scales almost linearly against power consumption. At higher scales, energy consumption increases a little more as internode communication workload rises, and efficiency per Watts decreases.
The real surprise is the total amount of energy involved – a combination of the very large cluster and the length of time required for training.
Because electricity generation causes the source CO2 emission, the useful figure for our purposes is the CO2 emission intensity, which is expressed as ‘kg CO2/kWh’. This unit tells us how many kilograms of CO2 are emitted for each kWh of electricity generated, and allows us to say how much carbon must be offset.
Using 2016 data supplied by The European Environment Agency, the European Union averages 0.296 kg CO2/kWh. Note that some countries report higher values, for example Romania 0.306, Germany 0.441, and Malta 0.648 kg CO2/kWh – in other words, those countries produce more CO2 for each kWh of electricity generated.
Using the lower EU figure, we estimate that training a model such as BigGAN for one month on 32x V100 GPUs (8x AC922s) implies a carbon offset of approximately 20 trees per month. (If you take your power from Malta, this rises to more than 40 trees per month!
Put another way, training ResNet50 v1.5 using 4x V100 GPUs generates up to 3 times more CO2 compared with a typical EU household’s 10 kWh daily electricity consumption.
What did we learn and what can be done?
- For large AI systems, it is clearly very important to inform users of the CO2 generated during training. Based on this information, data scientists can change their behavior;
- Because training AI systems requires large amounts of data, often processed over many weeks, energy consumption is significant, even at the individual model level. Renewable energy is the preferable source;
- In terms of enforcing the maximum of CO2 emissions, system workload managers can play a key role by providing continuous measurement, and enable managers to impose a yearly limit or target for each group of researchers;
- Any solution that reduces AI training times should be favored. For example, transfer learning can decrease training times considerably, and therefore cut CO2 emissions;
- ModelCheckpoint should be used as the usual practice — saving the learnable parameters (i.e. weights and biases) of the model that is trained at every x no of representative epochs, regardless of performance. For example tf.keras.callbacks.ModelCheckpoint callback allows to continually save the model both during and at the end of training – this is similar for Pytorch when using state_dict;
- For accelerated clusters – when high speed interconnect exists (ie. InfiniBand EDR) – datasets should be converted to HDF5 formats or TF records, so that large image or video datasets can be moved much faster from central storage to the compute nodes.
I am grateful for the invaluable support and assistance of the MIT-IBM Watson AI Lab and would like to especially thank Chris Hill, John Cohn, and my colleague Farid Parpia for their help and advice throughout the project.