35 Kernel arguments for low latency and high performance #
Configuring the appropriate kernel arguments is essential for optimizing performance, achieving low latency, and ensuring successful cluster deployment for telco workloads. While some parameters are designed specifically to enable the real-time kernel to function optimally, this section applies to both RT and default kernel configurations. Additionally, certain arguments are mandatory for Directed Network Provisioning method to successfully deploy downstream cluster nodes.
Remove
kthread_cpuswhen using SUSE real-time kernel. This parameter controls on which CPUs kernel threads are created. It also controls which CPUs are allowed for PID 1 and for loading kernel modules (the kmod user-space helper). This parameter is not recognized and does not have any effect.Isolate the CPU cores using
isolcpus,nohz_full,rcu_nocbs, andirqaffinity. For a comprehensive list of CPU pinning techniques, refer to CPU Pinning on Host (Chapter 36, CPU Pinning on Host) chapter.Add
domain,nohz,managed_irqflags toisolcpuskernel argument. Without any flags,isolcpusis equivalent to specifying only thedomainflag. This isolates the specified CPUs from scheduling, including kernel tasks. Thenohzflag stops the scheduler tick on the specified CPUs (if only one task is runnable on a CPU), and themanaged_irqflag avoids routing managed external (device) interrupts at the specified CPUs. Note that the IRQ lines of NVMe devices are fully managed by the kernel and will be routed to the non-isolated (housekeeping) cores as a consequence. For example, the command line provided at the end of this section will result in only four queues (plus an admin/control queue) allocated on the system:for I in $(grep nvme0 /proc/interrupts | cut -d ':' -f1); do cat /proc/irq/${I}/effective_affinity_list; done | column 39 0 19 20 39This behavior prevents any disruption caused by disk I/O to any time sensitive application running on the isolated cores, but might require attention and careful design for storage focused workloads.
Tune the ticks (kernel’s periodic timer interrupts):
skew_tick=1: ticks can sometimes happen simultaneously. Instead of all CPUs receiving their timer tick at the exact same moment,skew_tick=1makes them occur at slightly offset times. This helps reduce system jitter, resulting in more consistent and lower interrupt response times (an essential requirement for latency-sensitive applications).nohz=on: stops the periodic timer tick on idle CPUs.nohz_full=<cpu-cores>: Stops the periodic timer tick on specified CPUs that are dedicated for real-time applications.
Disable Machine Check Exception (MCE) handling by specifying
mce=off. MCEs are hardware errors detected by the processor and disabling them can avoid noisy logs.Add
nowatchdogto disable the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. When it expires (i.e. a soft lockup is detected), it will print a warning (in the hard interrupt context), running any latency targets. Even if it never expires, it goes onto the timer list, slightly increasing the overhead of every timer interrupt. This option also disables the NMI watchdog, so NMIs cannot interfere.nmi_watchdog=0disables the NMI (Non-Maskable Interrupt) watchdog. This can be omitted whennowatchdogis used.RCU (Read-Copy-Update) is a kernel mechanism that enables concurrent, lock-free access for many readers to shared data. An RCU callback, a function triggered after a 'grace period', ensures all previous readers have finished so old data can be safely reclaimed. We fine-tune RCU, particularly for sensitive workloads, to offload these callbacks from dedicated (pinned) CPUs, preventing kernel operations from interfering with critical, time-sensitive tasks.
Specify the pinned CPUs in
rcu_nocbsso that RCU callbacks do not run on them. This helps reducing jitter and latency for the real-time workloads.rcu_nocb_pollmakes the no-callback CPUs regularly 'poll' to see if callback handling is required. This can reduce the interrupt overhead.rcupdate.rcu_cpu_stall_suppress=1suppresses RCU CPU stall warnings, which can sometimes be false positives in heavily loaded real-time systemsrcupdate.rcu_expedited=1speeds up the grace period for RCU operations, making read-side critical sections more responsivercupdate.rcu_normal_after_boot=1When used with rcu_expedited, it allows RCU to revert to normal (non-expedited) operation after the system boot.rcupdate.rcu_task_stall_timeout=0disables the RCU task stall detector, preventing potential warnings or system halts from long-running RCU tasks.rcutree.kthread_prio=99sets the priority of the RCU callback kernel thread to the highest possible (99), ensuring it gets scheduled and handles RCU callbacks promptly, when needed.
Add
ignition.platform.id=openstackfor Metal3 and Cluster API to successfully provision/deprovision the cluster. This is used by Metal3 Python agent, which originated from Openstack Ironic.Enable Predictable Network Interface Naming via
net.ifnames=1. From SUSE Linux Micro 6.2 onwards, this is enabled by default and explicit configuration is not required. For versions prior to 6.2, this must be explicitly set as a kernel argument. This aligns with thepredictableNicNamesconfiguration in the Management Cluster’s Metal3 Helm chart, which is required for Directed Network Provisioning to function correctly. Consistent interface naming is also critical when SR-IOV is utilized.Remove
intel_pstate=passive. This option configuresintel_pstateto work with generic cpufreq governors, but to make this work, it disables hardware-managed P-states (HWP) as a side effect. To reduce the hardware latency, this option is not recommended for real-time workloads.Replace
intel_idle.max_cstate=0 processor.max_cstate=1withidle=poll. To avoid C-State transitions, theidle=polloption is used to disable the C-State transitions and keep the CPU in the highest C-State. Theintel_idle.max_cstate=0option disablesintel_idle, soacpi_idleis used, andacpi_idle.max_cstate=1then sets max C-state for acpi_idle. On AMD64/Intel 64 architectures, the first ACPI C-State is alwaysPOLL, but it uses apoll_idle()function, which may introduce some tiny latency by reading the clock periodically, and restarting the main loop indo_idle()after a timeout (this also involves clearing and setting theTIF_POLLtask flag). In contrast,idle=pollruns in a tight loop, busy-waiting for a task to be rescheduled. This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread.Disable C1E in BIOS. This option is important to disable the C1E state in the BIOS to avoid the CPU from entering the C1E state when idle. The C1E state is a low-power state that can introduce latency when the CPU is idle.
The rest of this documentation covers additional parameters, including huge pages and IOMMU.
This provides an example of kernel arguments for a 32-core Intel server, including the aforementioned adjustments:
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 skew_tick=1 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 ignition.platform.id=openstack net.ifnames=1 intel_iommu=on iommu=pt irqaffinity=0,31,32,63 isolcpus=domain,nohz,managed_irq,1-30,33-62 nohz_full=1-30,33-62 nohz=on mce=off nosoftlockup nowatchdog nmi_watchdog=0 quiet rcu_nocb_poll rcu_nocbs=1-30,33-62 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1 idle=pollHere is another configuration example for a 64-core AMD server. Among the 128 logical processors (0-127), first 8 cores (0-7) are designated for housekeeping, while the remaining 120 cores (8-127) are pinned for the applications:
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=575291cf-74e8-42cf-8f2c-408a20dc00b8 skew_tick=1 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 ignition.platform.id=openstack net.ifnames=1 amd_iommu=on iommu=pt irqaffinity=0-7 isolcpus=domain,nohz,managed_irq,8-127 nohz_full=8-127 rcu_nocbs=8-127 mce=off nohz=on nowatchdog nmi_watchdog=0 nosoftlockup quiet rcu_nocb_poll rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1 idle=poll