RHEL-RT LatencyTracerHowto

From RHEL-RT

Jump to: navigation, search

Latency Tracer Usage

The realtime kernel comes with instrumentation, referred to as the latency tracer which is used identify cases where a desired maximum latency is being exceeded. The latency tracer can be considered a peak detector as once enabled it will display information on the longest running non-preemptable kernel codepath. As this runs, when longer non-preemptable kernel codepaths are encountered it will continue to identify those codepaths.


This tool is particularly useful in isolating whether observed non-determinism is caused by the Linux kernel, vs glibc or user space applications. For example, suppose the observed response time for an overall transaction was 200ms. By using the latency tracer if you see that the longest kernel codepath is 20ms then its an application or library issue. Since the latency tracer is a kernel feature it is primarily useful in diagnosing sources of kernel non-determinism. It is not useful for distinguishing where specifically in libraries or application programs the bottlenecks are occurring in. To do that you use the standard user space profiling tools.


This howto will help you use the latency tracer capabilities to pinpoint with increasing level of details where such long latency codepaths occur.


The latency tracer feature is enabled in the trace variant of the realtime kernel. It is not enabled in the production version of the realtime kernel (to avoid the extra overhead). It is also not enabled in the debug variant of the realtime kernel due to the significative overhead the debug code brings and the negative effect that would have in latency measurement.


(editorial note: Describe the % overhead this instrumentation costs.)


Note: the latency tracer capability is not present in the standard RHEL5 kernel - as we introduced it upstream after the RHEL5 branch. Unfortunately, we are unable to add this feature in a minor update release of RHEL5 as that would necessitate kernel ABI incompatibility. Additionally, this is to large and invasive a change to introduce in a maintenance update.


The /proc/sys/kernel/preempt_max_latency knob can be used to specify the maximum latency trigger, and it can be reset at any time, without the need to reboot the system, it starts with a very high value, effectively making it silent:


[acme@mica ~]$ cat /proc/sys/kernel/preempt_max_latency
4398046511


The tracing infrastructure keeps track of the longest latency codepath. Each time a new codepath of longer latency is encountered, the corresponding tracing information will be logged. Latencies of equal or lower duration will not be displayed. The main objective is to identify the highest latency codepath - not to display latencies of all codepaths.


For example, to start a new session, where we are interested in discovering the maximum latency in the system:


[root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency


This tells the tracing infrastructure that anything that is over the value set into /proc/sys/kernel/preempt_max_latency should be recorded and then this file will be modified to have the maximum latency experienced, i.e. only new "events" that create a latency larger than the previous maximum latency will be recorded. By designating a value of 0 in this example, that designates no minimum trigger and a series of latency events will be logged until a true maximum is encountered. This won't flood the system logs, usually it just emits about 10 entries until it gets to a maximum overall, i.e. the maximum latency experienced in the system.


In this context, an "event" is defined as any code sequence, whether it originated from a system call, an interrupt, or anything that disables preemption. (editorial note: can that be a user application process?)


If the the goal is to record latencies over 10us, just set /proc/sys/kernel/preempt_max_latency to 10. Then when a latency event that takes more than 10us is experienced it will be recorded in the system logs and can be viewed in /proc/latency_trace, as will be demonstrated later in the document.


The routines associated with the latency tracing logging are excluded from being measured so that they do not appear in the output.


As a result of latency tracing events triggering, corresponding messages will appear in the system log:


[root@mica ~]# dmesg | grep maximum-latency
(        IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup.
( softirq-sched/2-36   |#2): new 3 us maximum-latency wakeup.
( softirq-timer/1-18   |#1): new 5 us maximum-latency wakeup.
( softirq-timer/1-18   |#1): new 6 us maximum-latency wakeup.
( softirq-sched/0-10   |#0): new 6 us maximum-latency wakeup.
( softirq-sched/0-10   |#0): new 6 us maximum-latency wakeup.
( softirq-sched/2-36   |#2): new 7 us maximum-latency wakeup.
[root@mica ~]#


Here we see the names of the processes where the new maximum latency triggered - as well as successive higher latency codepath triggers.


Further information describing the latency tracer fields appearing in the system log:


(        IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup.


IRQ-8406   Process Name
2489       Process ID
#2         CPU


In the file /proc/latency_trace we can find much more information about the maximum latency event. Note: this /proc/latency_trace capability introduces system overhead and is therefore not enabled by default in the standard realtime kernel. To utilize this capability, you must be running the kernel-rt-trace kernel.


[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
 latency: 7 us, #2/2, CPU#2 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
    -----------------
    | task: softirq-sched/2-36 (uid:0 nice:0 policy:1 rt_prio:50)
    -----------------


                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
   <...>-36    2D..1    7us : thread_return (thread_return)
[root@mica ~]#


Dissecting the header:


latency: 7 us - maximum latency event thusfar
#2            - Number of entries in this event (editorial: describe these
                2 fields in more detail)
/2            - Number of entries + underruns and overruns
CPU#2         - CPU where this event took place
M:rt          - Type of preemption enabled in this kernel build, can
                be one of:
                        server(CONFIG_PREEMPT_NONE)
                        desktop(CONFIG_PREEMPT_VOLUNTARY)
                        preempt(CONFIG_PREEMPT_DESKTOP)
                        rt(CONFIG_PREEMPT_RT)
VP:0          - always 0
KP:0          - always 0
SP:1          - softirq preemption (CONFIG_PREEMPT_SOFTIRQS)
HP:1          - hardirq preemption (CONFIG_PREEMPT_HARDIRQS)
#P:4          - Number of online CPUs (CONFIG_SMP)


The configuration that comes with RHEL5-RT is:


M:rt          - RT preemption enabled (CONFIG_PREEMPT_RT)
SP:1          - softirq preemption (CONFIG_PREEMPT_SOFTIRQS)
HP:1          - hardirq preemption (CONFIG_PREEMPT_HARDIRQS)


The tracing infrastructure records the trace entries in a buffer. When preemption is disabled, starting an event, it records the position in the trace buffer. When preemption is re-enabled, concluding this event, it checks if the event duration exceeds the prior maximum latency (stored in /proc/sys/kernel/preempt_max_latency). If so, this file will be updated with the new event duration and /proc/latency_trace is updated with a copy of the trace buffer containing the entries for the entire event.


The contents of the trace entries are governed by additional knobs, as described in upcoming sections.


One such latency control knob is /proc/sys/kernel/trace_verbose:


[root@mica ~]# cat /proc/sys/kernel/trace_verbose
0


This parameter is a boolean (not a verbosity knob) which is disabled by default. To enable:


[root@mica ~]# echo 1 > /proc/sys/kernel/trace_verbose
# Reset the maximum latency
[root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency
[root@mica ~]# dmesg | grep maximum-latency
(        IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup.
(        IRQ-8406-2489 |#2): new 1 us maximum-latency wakeup.
( softirq-sched/2-36   |#2): new 3 us maximum-latency wakeup.
( softirq-sched/2-36   |#2): new 4 us maximum-latency wakeup.
( softirq-sched/2-36   |#2): new 4 us maximum-latency wakeup.
( softirq-timer/1-18   |#1): new 4 us maximum-latency wakeup.
( softirq-sched/1-23   |#1): new 6 us maximum-latency wakeup.
( softirq-sched/1-23   |#1): new 7 us maximum-latency wakeup.


As we saw earlier we get the first few preemption disabled events and then top out with a maximum latency of 7us.


Now that trace_verbose has been enabled, we get more detailed information from /proc/latency_trace:


[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
 latency: 7 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
    -----------------
    | task: softirq-sched/1-23 (uid:0 nice:0 policy:1 rt_prio:50)
    -----------------


(T6/#0) [ =>          swapper ] 0.000ms (+0.007ms)
(T1/#1)            <...>    23 1 32 00000001 00000001 [0000031c1e6c80ae] 0.007ms (+0.000ms): thread_return+0xe7/0x1e6 <ffffffff80264665> (thread_return+0xe7/0x1e6 <ffffffff80264665>)
[root@mica ~]#


Dissecting the first line:


(T6/#0) [ =>          swapper ] 0.000ms (+0.007ms)


T6                         Trace type (refer to detail below)
#0                         Trace Index
[ =>          swapper ]    process command line
0.000ms                    Absolute time in this event
+0.007ms                   Relative time to next trace entry


Absolute time in this event means the time from when preemption was disabled until this trace entry. (editorial note: what does a duration of 0 indicate?)


The trace types are: (editorial note: describe these types in more detail)


1 = FN
2 = SPECIAL
3 = PID
4 = U64
5 = SYM
6 = CMDLINE
7 = SYSCALL
8 = SYSRET


Depending on the first column ("trace type") the subsequent columns after the second can have different meanings. Dissecting the second entry in this maximum latency event:


(T1/#1)            <...>    23 1 32 00000001 00000001 [0000031c1e6c80ae] 0.007ms (+0.000ms): thread_return+0xe7/0x1e6 <ffffffff80264665> (thread_return+0xe7/0x1e6 <ffffffff80264665>)


T1                  Trace type: FN
#1                  Trace Index (editorial: huh?)
23                  Process ID
1                   CPU where this entry took place
32                  Flags (editorial: huh)
00000001            Preempt count (editorial: huh?)
00000001            Trace Index (editorial: different from above Trace Index?)
[0000031c1e6c80ae]  Timestamp
0.007ms             Absolute time
+0.000ms            Relative time to next entry (editorial: what does 0
indicate?)
thread_return       Function where this trace entry took place
+0xe7/0x1e6         Offset in the function where this occurred
<ffffffff80264665>  IP (Instruction Pointer)
thread_return       Function that called the function where this trace entry happened
+0xe7/0x1e6         Offset in this function (editorial: duplicate from above?)
<ffffffff80264665>  IP (Instruction Pointer) (editorial: duplicate from
above?)


The other trace types will be discussed as we encounter them in the upcoming sections.


The tracing infrastructure will continuously keep records of the functions being called. Snapshots are taken when a new maximum latency is encountered whenever the /proc/sys/kernel/mcount_enabled and /proc/sys/kernel/trace_enabled knobs are enabled.


As with the other knobs they are initially disabled:


[root@mica ~]# cat /proc/sys/kernel/trace_enabled
0
[root@mica ~]# cat /proc/sys/kernel/mcount_enabled
0


Enabling is accomplished in the usual manner:


[root@mica ~]# echo 1 > /proc/sys/kernel/mcount_enabled


The realtime kernel is built using the gcc profiling mechanism. Here the compiler inserts calls to a function called mcount. As implemented in the realtime kernel, mcount examines the value of /proc/sys/kernel/mcount_enabled, immediately returning if it is zero. Effectively reducing the overhead to just one function call plus a branch when full fledged function traces are not desired or deemed too costly in the troubleshooting process. Alternatively, if /proc/sys/kernel/mcount_enabled is enabled, the mcount function calls the tracing infrastructure.


[root@mica ~]# echo 1 > /proc/sys/kernel/trace_enabled


Now the tracing infrastructure will continuously keep record of the functions being called, taking snapshots when a new maximum latency is experienced.


To start a new session with tracing enabled:


[root@mica ~]# echo 0 > /proc/sys/kernel/preempt_max_latency


After the workload has been given an opportunity to run, /proc/latency_trace contains the backtrace of the maximum latency experienced thusfar:


[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
 latency: 10 us, #13/13, CPU#2 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
    -----------------
    | task: softirq-sched/2-36 (uid:0 nice:0 policy:1 rt_prio:50)
    -----------------

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
    sshd-3338  2D.h1    0us : __trace_start_sched_wakeup <<...>-36> (49 2)
    sshd-3338  2D.h1    0us : __trace_start_sched_wakeup (ffffffff80248930 0 0)
    sshd-3338  2DNh1    1us : try_to_wake_up <<...>-36> (150 -4)
    sshd-3338  2DNh.    1us : try_to_wake_up (ffffffff8028db3f 0 0)
    sshd-3338  2DNh1    2us : __activate_task <<...>-31> (150 3)
    sshd-3338  2DNh1    3us : try_to_wake_up <<...>-31> (150 -4)
    sshd-3338  2DNh1    4us : enqueue_hrtimer (f5ebcbf5c0 ffff8100070c5000)
    sshd-3338  2DN.1    6us+: __sched_text_start (ffffffff80264ea9 0 0)
   <...>-36    2D..2    9us : thread_return <sshd-3338> (-4 150)
   <...>-36    2D..1    9us : trace_stop_sched_switched <<...>-36> (49 2)
   <...>-36    2D..1   10us : thread_return (thread_return)


(editorial note: what is need-resched, and preempt-depth?)


Another way of using the tracing infrastructure is by enabling /proc/sys/kernel/trace_user_triggered. That will trace only processes that explicitly used the prctl(0, 1) syscall just before the critical session where tracing is desired and prctl(0, 0) after it. This provides a mechanism to isolate tracing activity to a codepath of interest. During this more isolated form of tracing, the remaining tracking control knobs have their same behavioral effect.


When /proc/sys/kernel/trace_user_triggered is enabled, no preemption events will be recorded in /proc/sys/kernel/preempt_max_latency nor in /proc/latency_trace; until an application calls prctl(0, 1) and will stop recording events as soon as the application calls prctl(0, 0).


The following is an illustrative example demonstrating the limited tracing capability:


[root@mica rt]# cat usleep.c
int main(int argc, char *argv[])
{
        int us = atoi(argv[1]);

        prctl(0, 1);
        usleep(us);
        prctl(0, 0);
        return 0;
}


Error checking has been omitted for brevity, but the prctl calls can fail if the following step is not done prior to running the above program:


[root@mica rt]# echo 1 > /proc/sys/kernel/trace_user_triggered


Now lets demonstrate a user triggered tracing session:


[root@mica rt]# echo 0 > /proc/sys/kernel/preempt_max_latency


Here we can see a difference of user triggered tracing, as dmesg won't show any new max latency before a program uses prctl(0, 1) to enabling tracing.


[root@mica rt]# ./usleep 50


This makes the critical session in the example program above take at least 50us. To examine the corresponding latency trace logged:


[root@mica rt]# dmesg | grep latency | tail -1
(          usleep-4656 |#3): new 65 us user-latency.
[root@mica rt]#


As with the messages for the kernel latency tracer we can see the name of the process (usleep), its PID (4656), the CPU where it was run (3), as well as the user level latency detected, 65us.


[root@mica ~]# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.21-rc3-rt0
--------------------------------------------------------------------
 latency: 65 us, #20/20, CPU#3 | (M:rt VP:0, KP:0, SP:1 HP:1 #P:4)
    -----------------
    | task: usleep-4656 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------
 => started at: sys_prctl+0x20/0x1b7 <ffffffff8029d4fe>
 => ended at:   thread_return+0xe7/0x1e6 <ffffffff80264665>


                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  usleep-4656  3....    1us+< (0)
  usleep-4656  3....    5us > sys_nanosleep (00007fff1e9b4310 0000000000000000 0000000000000000)
  usleep-4656  3D..1    6us : enqueue_hrtimer (60c 2501f5f ffff8100c8255e98)
  usleep-4656  3D..1    7us : __sched_text_start (ffffffff80264929 0 0)
  usleep-4656  3D..2    8us+: deactivate_task <usleep-4656> (-3 1)
  <idle>-0     3D..2   11us+: thread_return <usleep-4656> (-3 20)
  <idle>-0     3D.h1   57us : hrtimer_interrupt (60c 2502312 0)
  <idle>-0     3D.h2   58us : hrtimer_interrupt (60c 2501f5f ffff8100c8255e98)
  <idle>-0     3D.h2   58us : try_to_wake_up (ffffffff8028db3f 0 0)
  <idle>-0     3D.h3   59us : effective_prio <usleep-4656> (-3 -3)
  <idle>-0     3D.h3   59us : __activate_task <usleep-4656> (-3 0)
  <idle>-0     3DNh3   60us : try_to_wake_up <usleep-4656> (-3 20)
  <idle>-0     3DN.1   61us : __sched_text_start (ffffffff8024af02 0 0)
  <idle>-0     3DN.2   62us : effective_prio <usleep-4656> (-3 -3)
  usleep-4656  3D..2   63us : thread_return <<idle>-0> (20 -3)
  usleep-4656  3....   63us < (0)
  usleep-4656  3....   64us > sys_prctl (0000000000000000 0000000000000000 0000000000000000)


MRG Realtime provides a tool called trace-cmd ( for further information, please check RHEL-RT TraceCmdHowto), available in the yum repository, that executes a given task with user-triggered latency tracing enabled. It turns latency tracer on, executes the requested application, turns latency tracer off and displays the latency tracer log.


[root@mica ~]# trace-cmd /usr/bin/signaltest -l 10000 -t 4


2. Troubleshooting (editorial note: complete this section)


2.1. Timestamp drifting:


Use "idle=poll" to workaround rdtsc drifting


Reference: http://lkml.org/lkml/2006/6/24/177

Personal tools