Reduce Linux power consumption, Part 2: General and governor-specific settings

Learn what you can set and how the settings control power usage

This three-part series is your starting point for tuning your system for power efficiency. In Part 2, follow a step-by-step guide on the general settings of the Linux CPUfreq subsystem and get more details on the five in-kernel governors -- performance, powersave, userspace, ondemand, and conservative -- and their settings.

Share your expertise:  What are you doing to save power for your company? Add your comments below.

Share:

Jenifer Hopper, Software Engineer, IBM  

author photo - Jenifer HopperJenifer Hopper is a software engineer for the IBM Linux performance group in Austin, Texas. Her current focus is on High Performance Computing (HPC) and energy workloads, as well as system profilers and data analysis tools.



23 September 2009

Also available in Russian Spanish

About this series

In this series, learn how to tune your Linux-based IBM System x® server for power efficiency. You'll learn about the in-kernel governors and their settings and how to use them; you'll also see the effects of the tuned governors on a power performance and e-commerce workload. The examples are based on a System x server running Red Hat Enterprise Linux version 5.2 (RHEL 5.2), but the same guidelines apply to any of the 2.6.x kernels, as well as any processor type that supports frequency scaling.

Part 1 introduces the components and concepts you'll need to tune your system for power efficiency, including the Linux CPUfreq subsystem, C and P states, and the five in-kernel governors.

Part 2 gives more details on the general settings of the Linux CPUfreq subsystem and the five in-kernel governors—performance, powersave, userspace, ondemand, and conservative—and their settings.

Part 3 compares the performance of the five in-kernel governors in both a tuned and an untuned state to show you what results you can achieve by power tuning your system.

CPUfreq general settings

Let's look at how easy it is to get started with the Linux CPUfreq subsystem by detailing its usage settings and providing some interface options. We'll start with some general settings like

  • The /sys interface
  • The cpuspeed settings file
  • cpufreq-utils

Using the /sys interface

The /sys filesystem provides a user interface for CPUfreq, starting at /sys/devices/system/cpu/. Some of these files are writable (by root) and others are read-only.

First, take a look at /sys/devices/system/cpu/. Here you will find a directory for each logical CPU and the sched_mc_power_savings tunable and, if available on your system, the sched_smt_power_savings tunable, which I will discuss later.

Listing 1. Checking the contents of /sys/devices/system/cpu/
[root@systemx ~]# cd /sys/devices/system/cpu/
[root@systemx cpu]# ls
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 sched_mc_power_savings

Inside each processor's directory is a cpufreq directory, which contains the CPUfreq interface:

Listing 2. Checking the cpufreq directory
[root@systemx cpu]# cd cpu0/cpufreq/
[root@systemx cpufreq]# ls -l
total 0
-r--r--r-- 1 root root 4096 Oct 31 14:53 affected_cpus
-r-------- 1 root root 4096 Oct 31 14:53 cpuinfo_cur_freq
-r--r--r-- 1 root root 4096 Oct 31 14:53 cpuinfo_max_freq
-r--r--r-- 1 root root 4096 Oct 31 14:53 cpuinfo_min_freq
-r--r--r-- 1 root root 4096 Oct 31 14:53 scaling_available_frequencies
-r--r--r-- 1 root root 4096 Oct 31 14:53 scaling_available_governors
-r--r--r-- 1 root root 4096 Oct 31 14:53 scaling_cur_freq
-r--r--r-- 1 root root 4096 Oct 31 14:53 scaling_driver
-rw-r--r-- 1 root root 0 Nov 5 11:44 scaling_governor
-rw-r--r-- 1 root root 4096 Oct 31 14:53 scaling_max_freq
-rw-r--r-- 1 root root 4096 Oct 31 14:53 scaling_min_freq

Join the green groups on My developerWorks

Discuss topics and share resources about energy, efficiency, and the environment on the GReen IT Report space and the Green computing group on My developerWorks.

If the governor is set to conservative or ondemand, you will also see a directory of the governor's name here. We will discuss how to change the governor later.

These files are available for every governor. We'll talk about what each of the settings mean and how to change some of them; then we will discuss some governor-specific settings beyond this interface. Note that the settings under the cpufreq directory can be different for each processor, so to get a uniform policy across processors, you must change the setting's value for each processor as described in the following sections.

First, affected_cpus shows which processors are affected by a frequency change. The reason is that some processors are frequency-dependent on each other due to a coordination of hardware or software or both, and must change frequencies at the same time. For example, you might see this type of setup:

Listing 3. Checking which processors are affected by frequency change
[root@systemx ~]# cd /sys/devices/system/cpu
[root@systemx cpu]# grep . cpu*/cpufreq/affected_cpus
cpu0/cpufreq/affected_cpus:0 1
cpu1/cpufreq/affected_cpus:0 1
cpu2/cpufreq/affected_cpus:2 3
cpu3/cpufreq/affected_cpus:2 3

Next, cpuinfo_cur_freq shows the processor's current operating frequency. The file scaling_cur_freq lists the current scaling frequency that the governors are using.

Listing 4. Checking frequencies
[root@systemx cpufreq]# cat cpuinfo_cur_freq
2997000
[root@systemx cpufreq]# cat scaling_cur_freq
2997000

All frequencies listed in this interface are in KHz.

Next are some files that provide information about the available processor frequencies. The files cpuinfo_max_freq and cpuinfo_min_freq hold the max and min frequencies available to the system; scaling_available_frequencies shows all available frequencies.

Listing 5. Checking max, min, and available frequencies
[root@systemx cpufreq]# cat cpuinfo_max_freq
2997000
[root@systemx cpufreq]# cat cpuinfo_min_freq
1998000
[root@systemx cpufreq]# cat scaling_available_frequencies
2997000 2664000 2331000 1998000

The scaling_available_governors file lists all available governors. If you do not see all five of the governors, check to make sure all the governors are enabled in your config file and that you have the governor's module loaded as I described in Part 1.

Listing 6. Checking available governors
[root@systemx cpufreq]# cat scaling_available_governors
ondemand powersave conservative userspace performance

The file scaling_driver will tell you what cpufreq driver your system is running. Some typical drivers include acpi, speedstep-smi, speedstep-centrino, powernor_k8, powernow_k7, longhaul, etc. If you wish to change the driver, you will need to unload the driver in use before loading another driver. Also, be sure to check that the driver will work with your processor before trying to use it.

Listing 7. Checking which cpufreq driver you system is running
[root@systemx cpufreq]# cat scaling_driver
centrino

The rest of the files in this directory are writable by root and give the user the ability to change some cpufreq settings. These files are the only settings the user is allowed to change for the powersave and performance governors. The other governors have additional settings available which we discuss in the next section.

First, the file scaling_governor shows the current governor enabled. To change the governor, simply echo the new governor's name into this file. Note that you must do this for every processor to obtain a uniform policy. For example:

Listing 8. Checking which governor is enabled and changing governors
[root@systemx ~]# cd /sys/devices/system/cpu/
[root@systemx cpu]# ls
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 sched_mc_power_savings
[root@systemx cpu]# cat cpu0/cpufreq/scaling_governor
performance
[root@systemx cpu]# echo conservative > cpu0/cpufreq/scaling_governor
[root@systemx cpu]# cat cpu0/cpufreq/scaling_governor
conservative

The scaling_max_freq and scaling_min_freq files show the max and min frequencies available to the governor. The user can change the range of frequencies available to the governor by echoing an available frequency into these files. Note that the frequency must be one of the frequencies listed in scaling_available_frequencies since these are all of the processor frequencies available to the system. Again, note that you must do this for every processor. For example:

Listing 9. Changing frequencies available to governor
[root@systemx ~]# cd /sys/devices/system/cpu/
[root@systemx cpu]# cat cpu0/cpufreq/scaling_available_frequencies
2997000 2664000 2331000 1998000
[root@systemx cpu]# cat cpu0/cpufreq/scaling_max_freq
2997000
[root@systemx cpu]# cat cpu0/cpufreq/scaling_min_freq
1998000
[root@systemx cpu]# echo 2331000 > cpu0/cpufreq/scaling_min_freq
[root@systemx cpu]# cat cpu0/cpufreq/scaling_min_freq
2331000

Using the cpuspeed settings file

In addition to directly echoing in the values for the settings as mentioned previously, a user can also use the cpuspeed settings file to change the driver, governor, min and max speeds, utilization thresholds, and the ignore_nice_load setting. RHEL 5.2 comes with cpuspeed included, but other distributions of Linux may not contain this package. If cpuspeed is not included in your distribution, you can download the carlthompson.net version; directions for installation are provided in the README. To use the RHEL 5.2 version of cpuspeed, simply edit the /etc/sysconfig/cpuspeed file, assign a value to any of the setting variables in the file, and issue the following command:

/etc/init.d/cpuspeed restart

This command will put the new settings into effect. Remember, you must have the corresponding governor module loaded to start using that governor unless it was already built in.

Using cpufreq-utils

RHEL 5.2 and some other distributions also come with the cpufreq-utils package that allows for another user interface to the CPUfreq subsystem. Most other distributions should have this package included as well. When you install the cpufreq-utils rpm, you will have two utilities called cpufreq-info and cpufreq-set.

The cpufreq-info utility will list information about the processors and their CPUfreq settings such as the current frequency, the frequency limits, the CPUfreq driver, the current policy, the current governor, and the affected-cpus list.

The cpufreq-set utility allows the user to change each processor's range of available frequencies, the operating governor, and the current running frequency when the userspace governor is enabled. For more information, see the cpufreq-info and cpufreq-set man pages.


Governor-specific settings

Now let's discuss settings the user can change in the in-kernel governors.

The powersave and performance governors

These governors statically set the processor frequency to the lowest and highest frequencies, respectively. The only settings the user can change are the settings I discussed in the previous section.

Userspace governor

Now we start the discussion on governor-specific settings. If you enable the userspace governor, you will also see a file called scaling_setspeed that is writable by root in the cpufreq directory. This governor allows the user or a program in userspace to interactively change the processor frequency. The user has the ability to echo in the desired frequency to this file or allow some userspace daemon to set this value. As we did with the files we discussed earlier, you must change the scaling_setspeed file for each processor.

Numerous daemons work with the userspace governor to adjust the processor frequency; here are a few examples:

  • cpudyn (CPU dynamic frequency control): This daemon bases frequency changes off of the processor load and also has the ability to put disks in standby when there is no activity to save even more power.
  • cpufreqd: This daemon can be configured to react to battery level, AC status, temperature, running programs, processor usage, and more.
  • cpuspeed: This daemon can change frequencies based on processor demand, power supply changes, temperature, and more.
  • powernowd: This governor daemon bases frequency changes off of the processor load and has four different behavioral modes that users can chose.

Ondemand governor

If you load the ondemand governor, you will see a directory called ondemand in the cpufreq directory. Inside this directory are many tunable settings. All of the writable (by root) files can be changed by echoing in the new value as shown previously. Note that any changes to the ondemand settings will be applied systemwide so you will not need to change the setting for each processor.

Listing 10. Checking tunable settings for ondemand
[root@systemx ~]# cd /sys/devices/system/cpu/cpu0/cpufreq/ondemand/
[root@systemx ondemand]# ls -l
total 0
-rw-r--r-- 1 root root 4096 Nov 19 10:30 ignore_nice_load
-rw-r--r-- 1 root root 4096 Nov 19 10:30 powersave_bias
-rw-r--r-- 1 root root 4096 Nov 19 10:30 sampling_rate
-r--r--r-- 1 root root 4096 Nov 19 10:30 sampling_rate_max
-r--r--r-- 1 root root 4096 Nov 19 10:30 sampling_rate_min
-rw-r--r-- 1 root root 4096 Nov 19 10:30 up_threshold

The ignore_nice_load file can be set to 0 or 1 (with 0 being the default). When this parameter is set to 1, any processes with a "nice" value will not be counted toward the overall processor utilization. When it is set to 0, all processes are counted toward the utilization. This setting is useful when you are running something that requires a lot of processor but you don't care about the runtime. If you apply the "nice" setting to the process, you can prevent it from influencing the frequency decisions.

Next, the powersave_bias file is a setting that was brought about to slightly modify the behavior of the ondemand governor in order to save more power when the user has less emphasis on performance by reducing its target frequency by a specified percent. This setting can be set to a value between 1 and 1000 which will result in a 0.1 percent to 100 percent reduction in frequency.

The sampling_rate, measured in microseconds, determines how often the governor will look at the processor utilization so that it can determine which frequency to set. This setting must be set to a value in between the values of sampling_rate_min and sampling_rate_max.

Lastly, the up_threshold setting allows the user to change the max processor utilization threshold that triggers a change in processor frequencies. By default the up_threshold value is 80. This means that every sampling_rate, the kernel will check the processor utilization and if it is above 80 percent utilized, the governor will increase the frequency to the maximum frequency available.

Conservative governor

If you load the conservative governor, you will see a directory called conservative in the cpufreq directory. Inside this directory are many tunable settings. All of the writable (by root) files can be changed by echoing in the new value as shown previously. Note that any changes to the conservative settings will be applied systemwide so you will not need to change the setting for each processor.

Listing 11. Checking tunable settings for conservative
[root@systemx ~]# cd /sys/devices/system/cpu/cpu0/cpufreq/conservative/
[root@systemx conservative]# ls -l
total 0
-rw-r--r-- 1 root root 4096 Nov 19 11:31 down_threshold
-rw-r--r-- 1 root root 4096 Nov 19 11:31 freq_step
-rw-r--r-- 1 root root 4096 Nov 19 11:31 ignore_nice_load
-rw-r--r-- 1 root root 4096 Nov 19 11:31 sampling_down_factor
-rw-r--r-- 1 root root 4096 Nov 19 11:31 sampling_rate
-r--r--r-- 1 root root 4096 Nov 19 11:31 sampling_rate_max
-r--r--r-- 1 root root 4096 Nov 19 11:31 sampling_rate_min
-rw-r--r-- 1 root root 4096 Nov 19 11:31 up_threshold

The ignore_nice_load, sampling_rate, sampling_rate_max, sampling_rate_min, and up_threshold settings are the same settings described earlier with the ondemand governor.

The conservative governor also allows the user to set the down_threshold. For example, by default the down_threshold is set to 20. This means that every sampling_rate, the kernel will check the processor utilization and if it is below 20 percent utilized, the governor will decrease the frequency.

The freq_step setting changes the size of the frequency step the governor uses to change CPU frequency in either direction. By default this value is set to 5 which means the governor will change the frequency by 5 percent of the maximum or minimum frequency each time it makes a decision to change frequencies. If you set this value to 100, the governor will act exactly like the ondemand governor.

Lastly, the sampling_down_factor works as a multiplier with the sampling_rate to lessen how often the processor utilization is sampled. For example, if the sampling_rate was set to 10,000 and the sampling_down_factor was set to 2, the kernel would sample the processor utilization every 20,000 microseconds.

Scheduler tunables

Now we'll examine the two scheduler tunables —

  • sched_mc_power_savings for scheduling processes on cores.
  • sched_smt_power_savings for scheduling processes on hyperthreads on a core.

sched_mc_power_savings

The sched_mc_power_savings is a scheduler tunable located in the /sys/devices/system/cpu/ directory. Don't forget to set the CONFIG_SCHED_MC config file option to y as discussed in the setup section (from Part 1) if you want to use this tunable.

Listing 12. Checking location of sched_mc_power_savings
[root@systemx ~]# cd /sys/devices/system/cpu/
[root@systemx cpu]# ls -l
total 0
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu0
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu1
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu2
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu3
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu4
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu5
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu6
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu7
-rwxrwxr-x 1 root root 4096 Nov 19 09:54 sched_mc_power_savings

The sched_mc_power_savings file can be set to 0 or 1; 0 is the default. When it is set to 1, the scheduler tries to schedule processes on as few cores as possible so that the others can go idle. In other words, if all the processors are a little busy, sched_mc_power_savings tries to consolidate the work onto the fewest number of processors possible. This in turn allows some processors to be idle for longer which saves power, especially if the processor supports some sort of deep-sleep state like C states where it draws very little power when idle. The actual power savings can vary depending on many factors, including how many processors are available and which CPUfreq governor is running. When sched_mc_power_savings is set to 0, no special scheduling is done.

sched_smt_power_savings

The sched_smt_power_savings tunable is another scheduler tunable also located in the /sys/devices/system/cpu/ directory; however, this tuning option is available only on systems that support hyperthreading. Don't forget to set the CONFIG_SCHED_SMT config file option to y as discussed in the setup section (in Part 1) if you want to use this tunable.

Listing 13. Checking location of sched_smt_power_savings
[root@systemx ~]# cd /sys/devices/system/cpu/
[root@systemx cpu]# ls -l
total 0
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu0
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu1
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu2
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu3
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu4
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu5
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu6
drwxr-xr-x 5 root root 0 Nov 12 17:45 cpu7
-rwxrwxr-x 1 root root 4096 Nov 19 09:54 sched_mc_power_savings
-rwxrwxr-x 1 root root 4096 Nov 19 09:54 sched_smt_power_savings

Similar to the sched_mc_power_savings setting, the sched_smt_power_savings file can be set to 0 or 1; 0 is the default. When it is set to 1, the scheduler tries to schedule processes to as few hyperthreads on a core as possible so that the others can go idle and in turn save power through idle C states.

Next time

In Part 3, I will discuss the effects each of the governors can have on different workloads using two popular configuration workloads.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=430631
ArticleTitle=Reduce Linux power consumption, Part 2: General and governor-specific settings
publish-date=09232009