RHEL-RT kdump/kexec
From RHEL-RT
Contents |
Setting Up kdump and kexec in the MRG Realtime kernel
The boot kernel and the kdump kernel
This is the short version of how kdump works: When the normal kernel (the 'boot' kernel) is booted, it reserves a part of system RAM it will normally never touch. This reservation is controlled by the crashkernel kernel commandline option.
During system startup an init script loads another kernel (the 'kdump' kernel) into the reserved space. When a panic happens, control is transferred from the boot kernel to the kdump kernel. The kdump kernel boots up, using only the RAM reserved for it. It writes out a crash dump of the rest of RAM (the boot kernel's address space). Then it preforms a reboot, which will reset the machine and bring the boot kernel back up.
The main kernel that will be booted and run (until a crash happens, or new kernel is explicitly booted through kexec) is called the boot kernel. The only thing special about the boot kernel is that it needs a commandline option to prepare it for kdump/kexec use.
The kernel that gets booted when a crash happens, which performs the crash dump, is called the kdump kernel. This kernel needs to be built with support for being a kdump kernel, but that does not preclude its being used as a normal boot kernel.
Preparing the boot kernel for kdump and kexec
To set up a system to use kdump or kexec, you add a command line option to the boot kernel:
crashkernel=128M@16M
The first value, before the '@', is the amount of RAM to reserve. "128M" means 128 megabytes. The second value is the location. "16M" means the space should be reserved at 16 megabytes, or the address 0x1000000.
We also provide a script called rt-setup-kdump, that comes with newer versions of the rt-setup package, which creates a basic RHEL-RT (MRG) compliant kdump setup. It uses the available RHEL5.1 or RHEL5.2 kernel as the kdump kernel.
kdump and the RHEL5 kernel
In RHEL5, there is no separate kdump kernel (except for PPC, and that isn't a concern here). The main kernel shipped for RHEL5 can be used as a kdump kernel.
The RT kernel cannot be used as a kdump kernel, but it supports booting into another kdump kernel. We suggest using the RHEL5 kernel as your kdump kernel.
Configuring the RT kernel to use the RHEL5 kdump kernel
The kdump initscripts try by default to use the booted kernel as a kdump kernel. This will not work for the RT kernel.
To override the default action, specify the kernel version you want used as the kdump kernel in /etc/sysconfig/kdump:
KDUMP_KERNELVER="2.6.18-8.1.4.el5"
Note that this is not the filename or the N-V-R of the desired kernel, only the version and release.
Once you have this set up, you need to reboot with the crashkernel commandline option. If you have already done that, just restart the kdump service.
If you don't have the kdump service configured to auto-start, you will need to start it after boot:
service kdump start
If the service starts successfully, you can test it. NOTE: This will reboot your system!
echo c > /proc/sysrq-trigger
You should see the system panic, boot a new kernel, write a crash dump out, and then reboot. All of this could take while. On reboot into your normal kernel, your crash dump is in a subdirectory of /var/crash.
kdump problems with some hardware
- Some devices needs to be reset during the startup of the kdump kernel. The RHEL5.1 kernel supports a command-line option these devices recognize during kdump startup. If you have any problems getting the kdump kernel to work, edit the
/etc/sysconfig/kdumpfile and addreset_devices=1to theKDUMP_COMMANDLINE_APPENDvariable. This setting will be the default in RHEL5.2.
- Some machines (i.e. IBM's LS21) may show warning messages, as the ones presented below, when booting the kdump kernel as reported in BZ#448715. Some systems just show the messages and and keep on booting. Some systems may freeze after displaying the messages. A known workaround is to add "
acpi=noirq" as a boot parameter to the kdump kernel. This parameter should be used only when the messages below have been seen, because it may cause boot problems on systems not affected by the original issue. The offending message is:
irq 9: nobody cared (try booting with the "irqpoll" option) handlers: [<ffffffff811660a0>] (acpi_irq+0x0/0x1b) turning off IO-APIC fast mode. ...
