RHEL-RT SchedPrioHowto
From RHEL-RT
←Older revision | Newer revision→
Setting realtime scheduler priorities - MRG Realtime HOWTO
The realtime kernel has the benefit of allowing fine grained control of scheduler priorities. In fact, these capabilities even allow application level programs to be scheduled at a higher priority than kernel threads. This is of course a double-edged sword as it is possible to cause system hangs and other unpredictable behavior if crucial kernel processes are prevented from running as needed. In short, it is easy to shoot yourself in the foot. The purpose of this note is to provide information which will hopefully allow you to correctly tune your system - and preserve all your toes. Ultimately the correct settings are workload dependent.
In total, there are 100 realtime priority levels, ranging from a low of 0 to a high of 99. Higher priority processes run first.
The following is an example command which displays the current priorities of the various kernel threads:
service rtctl status
2 TS - [kthreadd]
3 FF 99 [migration/0]
4 FF 99 [posix_cpu_timer]
5 FF 50 [softirq-high/0]
6 FF 50 [softirq-timer/0]
7 FF 90 [softirq-net-tx/]
8 FF 90 [softirq-net-rx/]
9 FF 50 [softirq-block/0]
10 FF 50 [softirq-tasklet]
11 FF 50 [softirq-sched/0]
12 FF 50 [softirq-hrtimer]
13 FF 50 [softirq-rcu/0]
14 FF 99 [watchdog/0]
15 TS - [desched/0]
16 FF 99 [migration/1]
17 FF 99 [posix_cpu_timer]
18 FF 50 [softirq-high/1]
19 FF 50 [softirq-timer/1]
In the fragment of output from above, pids 2 and 15 are using SCHED_NORMAL policy while all the rest are SCHED_FIFO with realtime priorities.
There is a system startup script called rtctl which initializes the default
priorities of kernel threads. The following is a description of these threads
and rationalization of their default priority:
- watchdog - 99
- There is a diagnostic utility, rt_watchdog, whose purpose is to take corrective action in the event of a system hang. Must be run at highest priority to be effective.
- migration - 99
- The Linux kernel supports SMP (Symmetric MultiProcessing) on modern systems. In order to facilitate this, it has support for process (task) migration from one CPU (or set thereof) to another. There are two ways in which a process (task) may get migrated - synchronously when the scheduler is starved of some process (task) to run on a given CPU, and asynchronously in order to achieve better overall balance and system throughput. The migration threads (one per CPU) support this latter approach. Periodically, they will rebalance the active scheduler runqueues (the queues of processes (tasks) that are running on a given CPU). Priority here should reflect the importance of balancing and migrating running processes (tasks). These threads should run often enough to prevent inbalance, but not at the detriment of a critical real-time task that should not be blocked by a migration operation.
- posix_cpu_timer - 99
- What is this?
- HardIRQs
- These are threads bound to represent those real, physical, hardware interrupts coming into the system from physical hardware devices. In the RT kernel, these threads are used in order to provided a schedulable context for the hardware interrupts themselves. Priorities should reflect the relative importance of a given IRQ - bearing in mind that modern system architectures often have many devices sharing a single physical hardware IRQ "line". For example, in a critical networking system, it is important that the interrupt coming from the networking MAC (the network card) be handled with sufficient priority that other tasks relying upon it will not block due progress. Reality is compounded somewhat by the nature of modern GigE network MACs/cards in which the card switches to a "polling" mode at high interrupt count, but it is still important to be aware of the relative importance of different devices to overall system progress and their corresponding interrupt threads.
- SoftIRQs
- These are software representations of "interrupts" (bottom halves in older terminology). They are used as a software mechanism for scheduling some kind of processing that must happen at some point in the (near) future. Typically, softirqs are used by physical hardware device interrupt handler code in order to schedule some kind of post-interrupt processing, and similar operations. In the RT kernel, there is (again) a schedulable context for such softirqs - and some specialization to handle very high priority softirqs (e.g. splitting out some of the networking-related softirqs into other special threads). Priorities of softirqs will require some consideration of overall system progress, but generally should be set high enough that they won't block for a long time.
- softirq-net-tx - 90
- Network transmit handling (special case of softirq thread)
- Setting this to a high value such as 98 approximates the RHEL5 standard networking behavior
- softirq-net-rx - 90
- Network receive side handling (special case of softirq thread)
- Setting this to a high value such as 98 approximates the RHEL5 standard networking behavior
- softirq-hrtreal - 92
- What is this? (HRT timer related).
- softirq-hrtmono - 92
- What is this? (monotonic HRT related).
- Are there others which aren't realtime specific which should be taken into
consideration? Such as ext3 journaling, kswapd?
You may wonder why by default these kernel threads are placed so high in
the priority. The rationale is to have the default priorities integrate
well with the requirements of realtime java - RTSJ. The RTSJ requires a
range of priorities from 10-89. For this reason the above mentioned kernel
thread priorities are positioned at 90 and above - to avoid unpredictable
behavior should a long-running java application block essential system
services from running.
For deployments where realtime Java is not in use, there is a wide range of scheduling priorities below 90 which are at the disposal of applications. It is usually dangerous for user level applications to run at priority 90 and above - despite the fact that the capability exists. Preventing essential system services from running can result in unpredictable behavior such as:
- blocking network traffic
- preventing virtual memory paging
- data corruption due to blocking filesytem journaling
The main point being, that extreme caution should be used if scheduling any application thread above priority 89. If any application threads are scheduled above priority 89 they should ensure that the threads only run for a very short codepath. Failure to do so would undermine the low latency capabilities of the realtime kernel.
(Question: what if they use cpu pinning and interrupt binding - such that an individual cpu was only running application, not interrupts or general kernel functions. In this case is it ok to run the application on its dedicated cpu at priority higher than 89? In this case, is it ok if the application runs unpreemptably for long durations?)
(JCM-Answer: You still have some per-CPU threads, such as the migration stuff. But yes, other than what you have to have, you can basically tie up a single CPU with a single task. I've done this by doing effectively a chrt mask on the init task and forcing everything to leave one CPU alone, then running a critical real time task on that CPU with a very high priority. You can't get away from having a couple of kernel threads to contend with, though. Actually, I've never tried e.g. telling a migration thread for a CPU to use a new bitmask forcing it off that CPU. I assume that doesn't work...and bad things would happen anyway).
(Question: JCM notes: Oh, we also need a call on the kswapd situation - I mean, what's our advice to customers about swap. I'm assuming we recommend everything here is mlocked in place and give other advice appropriate to avoid the system needing to swap out anything remotely "real time" to disk.)
Setting Real-time Priority for Non-root Users
You can enable real-time priority use by non-root users and groups by editing the /etc/security/limits.conf file. See the limits.conf man page for more information. You'll need to set both the hard and soft limits for the rtprio item. You can do this on one line by using - as the type.
Here is an example line that enables real-time priority use for members of the 'jack' group:
@jack - rtprio 89
Note that the changes will not affect current processes or their children. If you have modified the settings for your user or one of your groups, you will need to log out and back in before the changes will take effect.
Use Case Example - Signaltest
There is a testing utility called signaltest which is useful for demonstrating the realtime system behavior. The following link is to a whitepaper which illustrates the effects of assigned realtime priorities.
