RHEL-RT AffinityHowto
From RHEL-RT
Contents |
Command Line and Graphical User Interfaces
This page describes how command-line tools can be used to adjust processor affinity and other values. There is also a graphical interface for adjusting the affinity and scheduler settings of processes and IRQ threads. See the Tuna GUI page for more information.
IRQ (interrupt) binding
Given that realtime environments need to minimize or eliminate latency in responding to various events, it is suggested that interrupts and user processes be isolated from one another on different dedicated cpus, if possible.
- In practice we have found that the optimal performance is entirely application specific. For example, in tuning financial services applications for different companies which perform similar function, the optimal performance tunings were completely different.
- For one firm, isolating 2 out of 4 cpus for operating system functions and interrupt handling; while dedicating the remaining 2 cpus purely for application handling was optimal.
- For another firm, binding the network related application processes onto a cpu which was handling the network device driver interrupt yielded optimal determinism.
- Ultimately, tuning is often accomplished by iterations of trying a variety of settings.
By dedicating a CPU to an interrupt source we can be sure that when an interrupt happens from this device it will not be delayed by other processing happening on this CPU. Also, the code and data structures needed to process this interrupt will have the highest possible likelihood to be in the processor data and instruction caches. This can be particularly important in cases such as high performance networking, where the speeds involved are in the limits of memory and peripheral bus bandwidth available. Here, any wait for memory to be fetched into processor caches will have a noticeable impact in overal processing time and determinism. As an increasing diversity of kernel subsystems become increasingly tuned for low latency, the benefits of IRQ binding will be greater.
On multi-core systems, this can be accomplished by:
- Disabling the irqbalance daemon. This daemon which is enabled by default periodically forces interrupts to be handled by CPUs in an even, fair manner. However in realtime deployments, you typically dedicate and bind applications and interrupts to specific CPUs. Hence the recommendation to disable irqbalance. The following steps disable irqbalance:
- # service irqbalance stop
- # chkconfig irqblanace off
- An alternative approach to is to use irqbalance, and tell it to follow what was passed thru the kernel command line "isolcpus=" parameter. Just edit this file: /etc/sysconfig/irqbalance and enable FOLLOW_ISOLCPUS. This will make irqbalance balance only on the remaining CPUs. Makes more sense for machines with more processors/cores, but works just fine on a two core/cpu machine.
- TBD: Need to add info here on IRQBALANCE_BANNED_CPUS env var enhancement in irqbalance.
- Manually assigning cpu affinity to each irq:
View which IRQ your devices are on by: (see example at bottom)
# cat /proc/interrupts
To isolate CPUs from device IRQ processing you need to use the following interface:
# ls -l /proc/irq/*/smp_affinity -rw------- 1 root root 0 2007-11-09 10:57 /proc/irq/0/smp_affinity -rw------- 1 root root 0 2007-11-09 10:57 /proc/irq/10/smp_affinity -rw------- 1 root root 0 2007-11-09 10:57 /proc/irq/11/smp_affinity
its a bitmask, ff means all CPUs can process this IRQ if possible. When you change this the effect is not imediate, it only changes after an interrupt happens, so it may seem confusing at first.
Specific interrupt handlers by setting a mask that indicates which CPUs should service the IRQs. This can be done at runtime with:
echo 1 > /proc/irq/1/smp_affinity echo 2 > /proc/irq/2/smp_affinity echo 4 > /proc/irq/3/smp_affinity etc
Note that which cpus handle which irqs (or sets of irqs) will vary based on hardware configuration and application workload.
- Isolating processes to cpus with the taskset utility
# bind a task to cpu 4 (remember this is a cpu mask) taskset 8 /usr/local/bin/my_embedded_process # or for an already running process id (PID) to cpu0 and cpu 1 taskset -p 0 PID#1 taskset -p 1 PID#2 etc
Note that above examples assume a 4 cpu system in which cpus 1, 2 and 3 handle irqs 1, 2 and 3 respectively, while cpu 4 handles a user space process. The objective of this example is to assure that, whenever possible, cpus should not handle both processes and interrupts. Isolating interrupts and processes serves to minimize the cache line miss likelihood for interrupt handling code.
You can also change the PID priority to realtime, in the range of 1 to 99, at the same time to move the pid with the following command:
taskset -p 8 chrt -f 1 PID#1
The default_affinity kernel command line option can be used to specify in which CPUs the init process and its children will be allowed to run. It is set, by the system, using the taskset utility, in early bootup. The mask specified will thus be used for the system processes while the ones not in the mask left reserved for special applications. These applications will start with the their parent's affinity, and shortly after would change their affinity mask to use the reserved CPUs. Please refer to the bootloader documentation on how to specify kernel command line options for one session or permanently.
The isolcpus kernel command line option can be used to specify a set of CPUs to isolate from the general scheduler, the remaining CPUs will then be used by the init process and its children - thereby preventing unintended running of unbound processes on these dedicated CPUs. The mask specified will thus reserve CPUs for special applications. These applications will start with the their parent's affinity, and shortly after would change their affinity mask to use the reserved CPUs. Please refer to the bootloader documentation on how to specify kernel command line options for one session or permanently.
The syntax for isolcpus is:
isolcpus=<CPU number>,...,<CPU number>
or
isolcpus=<CPU number>-<CPU number>
or a mixture:
isolcpus=<CPU number>,...,<CPU number>-<CPU number>
The first CPU is 0, the last is number of CPUs minus one.
Important Note In MRG Realtime much of the conventional interrupt processing is offloaded to separately schedulable kernel threads. This is done to avoid having long-running interrupt service routines becoming a source of non-determinism. If you bind interrupts to specific CPUs, it is also imperitive that you bind the corresponding kernel threads which do what is called the soft interrupt handling. For example, if you have bound network interrupts to CPU 2, then you should also bind the corresponding network send and receive threads. Refer to RHEL-RT SchedPrioHowto for details.
Related manual pages
chrt(1), taskset(1), nice(1), renice(1)
See sched_setscheduler(2) for a description of the Linux scheduling scheme.
Example
In this example we'll be using a source of interrupts that can be easily generated, namely the disk controller. Here we can generate lots of interrupts by reading files from the disks, being able to show how the interrupts will be isolated as commanded, the same principles apply to any other device that generates interrupts.
On a Dual dual core Xeon box:
/proc/interrupts has this format:
[root@mica ~]# head -2 /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 777 0 0 0 IO-APIC-edge timer
[root@mica ~]#
Each line has the IRQ number, followed by the number of interrupts that happened in each CPU, followed by the IRQ type (IO-APIC-edge) and IRQ description (timer).
Lets focus on IRQ 142: [root@mica ~]# cat /proc/interrupts | grep 142 142: 4573 1339 4561 1346 IO-APIC-fasteoi megasas
That is the megasas PERC SCSI card, the interrupts are being distributed on the 4 cores, now lets limit this to the first core:
[root@mica ~]# echo 1 > /proc/irq/142/smp_affinity Now generate some disk activity, then re-examine /proc/interrupts: [root@mica ~]# cat /proc/interrupts | grep 142 142: 52009 1339 4561 1346 IO-APIC-fasteoi megasas
Notice that all the interrupts were serviced by the first, core, as requested. This reduces contention for CPU time in the other CPUs, where high priority application tasks could have been bound.
Editorial Comments
Add more info, such as:
- default_affinity and isolcpus are described, but isolcpus is the preferred way of isolating CPUs, as it takes place even before init starts, remove the description of default_affinity? Its present since 2.6.9-rc2, and it has been verified to be documented in the RHEL4 kernel too.
- There also is a script called exodus that allows you to do this on a running system (taskset all running userspace pids). The exodus script is distributed with the rest of the realtime packages.
- The idea would be to use exodus to test out possibilities - then use default_affinity on commandline plus taskset in your app's init script for
production.
- Ingo mentioned there are additional boot options for cpu shielding - get more of this info. Its called isolcpus, and its documented in Documentation/kernel-parameters.txt. Should be used where default_affinity was being used.
- Write an example using the perlscript in the schedstat documentation showing that the isolated CPUs are indeed not being used to schedule processes other than the ones reserved for this CPU. This will involve using the kernel-rt-debug package, where SCHEDSTATS is enabled.
- Here's some good summary info from Rod's slides - this might make a good intro at the top of the page, then detail it below....
- First, clear off some processors
- using the exodus script
- Then run your app on the clear CPUs
- sched_setaffinity(2) taskset(1)
- Build it into production
- default_affinity or isolcpus on cmdline
- Taskset your app in its initscript
- First, clear off some processors
