RHEL-RT LatencyTracerHowto
From RHEL-RT
Latency Tracer Usage
The realtime kernel comes with instrumentation, referred to as the latency tracer which is used identify cases where a desired maximum latency is being exceeded. The latency tracer can be considered a peak detector as once enabled it will display information on the longest running non-preemptable kernel codepath. As this runs, when longer non-preemptable kernel codepaths are encountered it will continue to identify those codepaths.
This tool is particularly useful in isolating whether observed non-determinism is caused by the Linux kernel, vs glibc or user space applications. For example, suppose the observed response time for an overall transaction was 200ms. By using the latency tracer if you see that the longest kernel codepath is 20ms then its an application or library issue. Since the latency tracer is a kernel feature it is primarily useful in diagnosing sources of kernel non-determinism. It is not useful for distinguishing where specifically in libraries or application programs the bottlenecks are occurring in. To do that you use the standard user space profiling tools.
This howto will help you use the latency tracer capabilities to pinpoint with increasing level of details where such long latency codepaths occur.
The latency tracer feature is enabled in the trace variant of the realtime kernel. It is not enabled in the production version of the realtime kernel (to avoid the extra overhead). It is also not enabled in the debug variant of the realtime kernel due to the significative overhead the debug code brings and the negative effect that would have in latency measurement.
(editorial note: Describe the % overhead this instrumentation costs.)
Note: the latency tracer capability is not present in the standard RHEL5
kernel - as we introduced it upstream after the RHEL5 branch. Unfortunately, we are unable to add this feature in a minor update release of RHEL5 as that would necessitate kernel ABI incompatibility. Additionally, this is to large and invasive a change to introduce in a maintenance update.
The /proc/sys/kernel/preempt_max_latency knob can be used to specify the
maximum latency trigger, and it can be reset at any time, without the need
to reboot the system, it starts with a very high value, effectively making it
silent:
[acme@mica ~]$ cat /proc/sys/kernel/preempt_max_latency 4398046511
The tracing infrastructure keeps track of the longest latency codepath. Each time a new codepath of longer latency is encountered, the corresponding tracing information will be logged. Latencies of equal or lower duration will not be displayed. The main objective is to identify the highest latency codepath - not to display latencies of all codepaths.
For example, to start a new session, where we are interested in discovering the maximum latency in the system:
[root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency
This tells the tracing infrastructure that anything that is over the value set into /proc/sys/kernel/preempt_max_latency should be recorded and then this file will be modified to have the maximum latency experienced, i.e. only new "events" that create a latency larger than the previous maximum latency will be recorded. By designating a value of 0 in this example, that designates no minimum trigger and a series of latency events will be logged until a true maximum is encountered. This won't flood the system logs, usually it just emits about 10 entries until it gets to a maximum overall, i.e. the maximum latency experienced in the system.
In this context, an "event" is defined as any code sequence,
whether it originated from a system call, an interrupt, or anything that
disables preemption. (editorial note: can that be a user application process?)
If the the goal is to record latencies over 10us, just set
/proc/sys/kernel/preempt_max_latency to 10. Then when a latency event that
takes more than 10us is experienced it will be recorded in the system logs and
can be viewed in /proc/latency_trace, as will be demonstrated later in the
document.
The routines associated with the latency tracing logging are excluded
from being measured so that they do not appear in the output.
As a result of latency tracing events triggering, corresponding messages
will appear in the system log:
[root@mica ~]# dmesg | grep maximum-latency ( IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup. ( softirq-sched/2-36 |#2): new 3 us maximum-latency wakeup. ( softirq-timer/1-18 |#1): new 5 us maximum-latency wakeup. ( softirq-timer/1-18 |#1): new 6 us maximum-latency wakeup. ( softirq-sched/0-10 |#0): new 6 us maximum-latency wakeup. ( softirq-sched/0-10 |#0): new 6 us maximum-latency wakeup. ( softirq-sched/2-36 |#2): new 7 us maximum-latency wakeup. [root@mica ~]#
Here we see the names of the processes where the new maximum latency
triggered - as well as successive higher latency codepath triggers.
Further information describing the latency tracer fields appearing in the system log:
( IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup. IRQ-8406 Process Name 2489 Process ID #2 CPU
In the file /proc/latency_trace we can find much more information about the
maximum latency event. Note: this /proc/latency_trace capability introduces system overhead and is therefore not enabled by default in the standard realtime kernel. To utilize this capability, you must be running the kernel-rt-trace kernel.
[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
latency: 7 us, #2/2, CPU#2 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
-----------------
| task: softirq-sched/2-36 (uid:0 nice:0 policy:1 rt_prio:50)
-----------------
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
<...>-36 2D..1 7us : thread_return (thread_return)
[root@mica ~]#
Dissecting the header:
latency: 7 us - maximum latency event thusfar
#2 - Number of entries in this event (editorial: describe these
2 fields in more detail)
/2 - Number of entries + underruns and overruns
CPU#2 - CPU where this event took place
M:rt - Type of preemption enabled in this kernel build, can
be one of:
server(CONFIG_PREEMPT_NONE)
desktop(CONFIG_PREEMPT_VOLUNTARY)
preempt(CONFIG_PREEMPT_DESKTOP)
rt(CONFIG_PREEMPT_RT)
VP:0 - always 0
KP:0 - always 0
SP:1 - softirq preemption (CONFIG_PREEMPT_SOFTIRQS)
HP:1 - hardirq preemption (CONFIG_PREEMPT_HARDIRQS)
#P:4 - Number of online CPUs (CONFIG_SMP)
The configuration that comes with RHEL5-RT is:
M:rt - RT preemption enabled (CONFIG_PREEMPT_RT) SP:1 - softirq preemption (CONFIG_PREEMPT_SOFTIRQS) HP:1 - hardirq preemption (CONFIG_PREEMPT_HARDIRQS)
The tracing infrastructure records the trace entries in a buffer. When
preemption is disabled, starting an event, it records the position in the
trace buffer. When preemption is re-enabled, concluding this event, it checks
if the event duration exceeds the prior maximum latency (stored in
/proc/sys/kernel/preempt_max_latency). If so, this file will be updated with
the new event duration and /proc/latency_trace is updated with a copy of the
trace buffer containing the entries for the entire event.
The contents of the trace entries are governed by additional knobs, as
described in upcoming sections.
One such latency control knob is /proc/sys/kernel/trace_verbose:
[root@mica ~]# cat /proc/sys/kernel/trace_verbose 0
This parameter is a boolean (not a verbosity knob) which is disabled by
default. To enable:
[root@mica ~]# echo 1 > /proc/sys/kernel/trace_verbose # Reset the maximum latency [root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency [root@mica ~]# dmesg | grep maximum-latency ( IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup. ( IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup. ( softirq-sched/2-36 |#2): new 3 us maximum-latency wakeup. ( softirq-sched/2-36 |#2): new 4 us maximum-latency wakeup. ( softirq-sched/2-36 |#2): new 4 us maximum-latency wakeup. ( softirq-timer/1-18 |#1): new 4 us maximum-latency wakeup. ( softirq-sched/1-23 |#1): new 6 us maximum-latency wakeup. ( softirq-sched/1-23 |#1): new 7 us maximum-latency wakeup.
As we saw earlier we get the first few preemption disabled events and then
top out with a maximum latency of 7us.
Now that trace_verbose has been enabled, we get more detailed
information from /proc/latency_trace:
[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
latency: 7 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
-----------------
| task: softirq-sched/1-23 (uid:0 nice:0 policy:1 rt_prio:50)
-----------------
(T6/#0) [ => swapper ] 0.000ms (+0.007ms)
(T1/#1) <...> 23 1 32 00000001 00000001 [0000031c1e6c80ae] 0.007ms (+0.000ms): thread_return+0xe7/0x1e6 <ffffffff80264665> (thread_return+0xe7/0x1e6 <ffffffff80264665>)
[root@mica ~]#
Dissecting the first line:
(T6/#0) [ => swapper ] 0.000ms (+0.007ms) T6 Trace type (refer to detail below) #0 Trace Index [ => swapper ] process command line 0.000ms Absolute time in this event +0.007ms Relative time to next trace entry
Absolute time in this event means the time from when preemption was disabled
until this trace entry. (editorial note: what does a duration of 0 indicate?)
The trace types are:
(editorial note: describe these types in more detail)
1 = FN 2 = SPECIAL 3 = PID 4 = U64 5 = SYM 6 = CMDLINE 7 = SYSCALL 8 = SYSRET
Depending on the first column ("trace type") the subsequent columns after
the second can have different meanings. Dissecting the second entry in this
maximum latency event:
(T1/#1) <...> 23 1 32 00000001 00000001 [0000031c1e6c80ae] 0.007ms (+0.000ms): thread_return+0xe7/0x1e6 <ffffffff80264665> (thread_return+0xe7/0x1e6 <ffffffff80264665>) T1 Trace type: FN #1 Trace Index (editorial: huh?) 23 Process ID 1 CPU where this entry took place 32 Flags (editorial: huh) 00000001 Preempt count (editorial: huh?) 00000001 Trace Index (editorial: different from above Trace Index?) [0000031c1e6c80ae] Timestamp 0.007ms Absolute time +0.000ms Relative time to next entry (editorial: what does 0 indicate?) thread_return Function where this trace entry took place +0xe7/0x1e6 Offset in the function where this occurred <ffffffff80264665> IP (Instruction Pointer) thread_return Function that called the function where this trace entry happened +0xe7/0x1e6 Offset in this function (editorial: duplicate from above?) <ffffffff80264665> IP (Instruction Pointer) (editorial: duplicate from above?)
The other trace types will be discussed as we encounter them in the upcoming sections.
The tracing infrastructure will continuously keep records of the functions
being called. Snapshots are taken when a new maximum latency is encountered
whenever the /proc/sys/kernel/mcount_enabled and /proc/sys/kernel/trace_enabled knobs are enabled.
As with the other knobs they are initially disabled:
[root@mica ~]# cat /proc/sys/kernel/trace_enabled 0 [root@mica ~]# cat /proc/sys/kernel/mcount_enabled 0
Enabling is accomplished in the usual manner:
[root@mica ~]# echo 1 > /proc/sys/kernel/mcount_enabled
The realtime kernel is built using the gcc profiling mechanism. Here the
compiler inserts calls to a function called mcount. As implemented in the
realtime kernel, mcount examines the value of /proc/sys/kernel/mcount_enabled,
immediately returning if it is zero. Effectively reducing the overhead to
just one function call plus a branch when full fledged function traces are
not desired or deemed too costly in the troubleshooting process.
Alternatively, if /proc/sys/kernel/mcount_enabled is enabled, the mcount
function calls the tracing infrastructure.
[root@mica ~]# echo 1 > /proc/sys/kernel/trace_enabled
Now the tracing infrastructure will continuously keep record of the functions
being called, taking snapshots when a new maximum latency is experienced.
To start a new session with tracing enabled:
[root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency
After the workload has been given an opportunity to run, /proc/latency_trace
contains the backtrace of the maximum latency experienced thusfar:
[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
latency: 10 us, #13/13, CPU#2 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
-----------------
| task: softirq-sched/2-36 (uid:0 nice:0 policy:1 rt_prio:50)
-----------------
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
sshd-3338 2D.h1 0us : __trace_start_sched_wakeup <<...>-36> (49 2)
sshd-3338 2D.h1 0us : __trace_start_sched_wakeup (ffffffff80248930 0 0)
sshd-3338 2DNh1 1us : try_to_wake_up <<...>-36> (150 -4)
sshd-3338 2DNh. 1us : try_to_wake_up (ffffffff8028db3f 0 0)
sshd-3338 2DNh1 2us : __activate_task <<...>-31> (150 3)
sshd-3338 2DNh1 3us : try_to_wake_up <<...>-31> (150 -4)
sshd-3338 2DNh1 4us : enqueue_hrtimer (f5ebcbf5c0 ffff8100070c5000)
sshd-3338 2DN.1 6us+: __sched_text_start (ffffffff80264ea9 0 0)
<...>-36 2D..2 9us : thread_return <sshd-3338> (-4 150)
<...>-36 2D..1 9us : trace_stop_sched_switched <<...>-36> (49 2)
<...>-36 2D..1 10us : thread_return (thread_return)
(editorial note: what is need-resched, and preempt-depth?)
Another way of using the tracing infrastructure is by enabling
/proc/sys/kernel/trace_user_triggered. That will trace only processes that
explicitly used the prctl(0, 1) syscall just before the critical session
where tracing is desired and prctl(0, 0) after it. This provides a
mechanism to isolate tracing activity to a codepath of interest. During this
more isolated form of tracing, the remaining tracking control knobs have
their same behavioral effect.
When /proc/sys/kernel/trace_user_triggered is enabled, no preemption
events will be recorded in /proc/sys/kernel/preempt_max_latency nor
in /proc/latency_trace; until an application calls prctl(0, 1)
and will stop recording events as soon as the application calls prctl(0, 0).
The following is an illustrative example demonstrating the limited tracing
capability:
[root@mica rt]# cat usleep.c
int main(int argc, char *argv[])
{
int us = atoi(argv[1]);
prctl(0, 1);
usleep(us);
prctl(0, 0);
return 0;
}
Error checking has been omitted for brevity, but the prctl calls can fail if
the following step is not done prior to running the above program:
[root@mica rt]# echo 1 > /proc/sys/kernel/trace_user_triggered
Now lets demonstrate a user triggered tracing session:
[root@mica rt]# echo 0 > /proc/sys/kernel/preempt_max_latency
Here we can see a difference of user triggered tracing, as dmesg won't show any
new max latency before a program uses prctl(0, 1) to enabling tracing.
[root@mica rt]# ./usleep 50
This makes the critical session in the example program above take at least
50us. To examine the corresponding latency trace logged:
[root@mica rt]# dmesg | grep latency | tail -1 ( usleep-4656 |#3): new 65 us user-latency. [root@mica rt]#
As with the messages for the kernel latency tracer we can see the name of the
process (usleep), its PID (4656), the CPU where it was run (3), as well as the
user level latency detected, 65us.
[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
latency: 65 us, #20/20, CPU#3 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
-----------------
| task: usleep-4656 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------
=> started at: sys_prctl+0x20/0x1b7 <ffffffff8029d4fe>
=> ended at: thread_return+0xe7/0x1e6 <ffffffff80264665>
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
usleep-4656 3.... 1us+< (0)
usleep-4656 3.... 5us > sys_nanosleep (00007fff1e9b4310 0000000000000000 0000000000000000)
usleep-4656 3D..1 6us : enqueue_hrtimer (60c 2501f5f ffff8100c8255e98)
usleep-4656 3D..1 7us : __sched_text_start (ffffffff80264929 0 0)
usleep-4656 3D..2 8us+: deactivate_task <usleep-4656> (-3 1)
<idle>-0 3D..2 11us+: thread_return <usleep-4656> (-3 20)
<idle>-0 3D.h1 57us : hrtimer_interrupt (60c 2502312 0)
<idle>-0 3D.h2 58us : hrtimer_interrupt (60c 2501f5f ffff8100c8255e98)
<idle>-0 3D.h2 58us : try_to_wake_up (ffffffff8028db3f 0 0)
<idle>-0 3D.h3 59us : effective_prio <usleep-4656> (-3 -3)
<idle>-0 3D.h3 59us : __activate_task <usleep-4656> (-3 0)
<idle>-0 3DNh3 60us : try_to_wake_up <usleep-4656> (-3 20)
<idle>-0 3DN.1 61us : __sched_text_start (ffffffff8024af02 0 0)
<idle>-0 3DN.2 62us : effective_prio <usleep-4656> (-3 -3)
usleep-4656 3D..2 63us : thread_return <<idle>-0> (20 -3)
usleep-4656 3.... 63us < (0)
usleep-4656 3.... 64us > sys_prctl (0000000000000000 0000000000000000 0000000000000000)
MRG Realtime provides a tool called trace-cmd ( for further information, please check RHEL-RT TraceCmdHowto), available in the yum repository, that executes a given task with user-triggered latency tracing enabled. It turns latency tracer on, executes the requested application, turns latency tracer off and displays the latency tracer log.
[root@mica ~]# trace-cmd /usr/bin/signaltest -l 10000 -t 4
2. Troubleshooting
(editorial note: complete this section)
2.1. Timestamp drifting:
Use "idle=poll" to workaround rdtsc drifting
Reference: http://lkml.org/lkml/2006/6/24/177
