This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
SUSE Telco Cloud Documentation|Telco features configuration|Kernel arguments for low latency and high performance

35 Kernel arguments for low latency and high performance

Configuring the appropriate kernel arguments is essential for optimizing performance, achieving low latency, and ensuring successful cluster deployment for telco workloads. While some parameters are designed specifically to enable the real-time kernel to function optimally, this section applies to both RT and default kernel configurations. Additionally, certain arguments are mandatory for Directed Network Provisioning method to successfully deploy downstream cluster nodes.

  • Remove kthread_cpus when using SUSE real-time kernel. This parameter controls on which CPUs kernel threads are created. It also controls which CPUs are allowed for PID 1 and for loading kernel modules (the kmod user-space helper). This parameter is not recognized and does not have any effect.

  • Isolate the CPU cores using isolcpus, nohz_full, rcu_nocbs, and irqaffinity. For a comprehensive list of CPU pinning techniques, refer to CPU Pinning on Host (Chapter 36, CPU Pinning on Host) chapter.

  • Add domain,nohz,managed_irq flags to isolcpus kernel argument. Without any flags, isolcpus is equivalent to specifying only the domain flag. This isolates the specified CPUs from scheduling, including kernel tasks. The nohz flag stops the scheduler tick on the specified CPUs (if only one task is runnable on a CPU), and the managed_irq flag avoids routing managed external (device) interrupts at the specified CPUs. Note that the IRQ lines of NVMe devices are fully managed by the kernel and will be routed to the non-isolated (housekeeping) cores as a consequence. For example, the command line provided at the end of this section will result in only four queues (plus an admin/control queue) allocated on the system:

    for I in $(grep nvme0 /proc/interrupts | cut -d ':' -f1); do cat /proc/irq/${I}/effective_affinity_list; done | column
    39      0       19      20      39

    This behavior prevents any disruption caused by disk I/O to any time sensitive application running on the isolated cores, but might require attention and careful design for storage focused workloads.

  • Tune the ticks (kernel’s periodic timer interrupts):

    • skew_tick=1: ticks can sometimes happen simultaneously. Instead of all CPUs receiving their timer tick at the exact same moment, skew_tick=1 makes them occur at slightly offset times. This helps reduce system jitter, resulting in more consistent and lower interrupt response times (an essential requirement for latency-sensitive applications).

    • nohz=on: stops the periodic timer tick on idle CPUs.

    • nohz_full=<cpu-cores>: Stops the periodic timer tick on specified CPUs that are dedicated for real-time applications.

  • Disable Machine Check Exception (MCE) handling by specifying mce=off. MCEs are hardware errors detected by the processor and disabling them can avoid noisy logs.

  • Add nowatchdog to disable the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. When it expires (i.e. a soft lockup is detected), it will print a warning (in the hard interrupt context), running any latency targets. Even if it never expires, it goes onto the timer list, slightly increasing the overhead of every timer interrupt. This option also disables the NMI watchdog, so NMIs cannot interfere.

  • nmi_watchdog=0 disables the NMI (Non-Maskable Interrupt) watchdog. This can be omitted when nowatchdog is used.

  • RCU (Read-Copy-Update) is a kernel mechanism that enables concurrent, lock-free access for many readers to shared data. An RCU callback, a function triggered after a 'grace period', ensures all previous readers have finished so old data can be safely reclaimed. We fine-tune RCU, particularly for sensitive workloads, to offload these callbacks from dedicated (pinned) CPUs, preventing kernel operations from interfering with critical, time-sensitive tasks.

    • Specify the pinned CPUs in rcu_nocbs so that RCU callbacks do not run on them. This helps reducing jitter and latency for the real-time workloads.

    • rcu_nocb_poll makes the no-callback CPUs regularly 'poll' to see if callback handling is required. This can reduce the interrupt overhead.

    • rcupdate.rcu_cpu_stall_suppress=1 suppresses RCU CPU stall warnings, which can sometimes be false positives in heavily loaded real-time systems

    • rcupdate.rcu_expedited=1 speeds up the grace period for RCU operations, making read-side critical sections more responsive

    • rcupdate.rcu_normal_after_boot=1 When used with rcu_expedited, it allows RCU to revert to normal (non-expedited) operation after the system boot.

    • rcupdate.rcu_task_stall_timeout=0 disables the RCU task stall detector, preventing potential warnings or system halts from long-running RCU tasks.

    • rcutree.kthread_prio=99 sets the priority of the RCU callback kernel thread to the highest possible (99), ensuring it gets scheduled and handles RCU callbacks promptly, when needed.

  • Add ignition.platform.id=openstack for Metal3 and Cluster API to successfully provision/deprovision the cluster. This is used by Metal3 Python agent, which originated from Openstack Ironic.

  • Enable Predictable Network Interface Naming via net.ifnames=1. From SUSE Linux Micro 6.2 onwards, this is enabled by default and explicit configuration is not required. For versions prior to 6.2, this must be explicitly set as a kernel argument. This aligns with the predictableNicNames configuration in the Management Cluster’s Metal3 Helm chart, which is required for Directed Network Provisioning to function correctly. Consistent interface naming is also critical when SR-IOV is utilized.

  • Remove intel_pstate=passive. This option configures intel_pstate to work with generic cpufreq governors, but to make this work, it disables hardware-managed P-states (HWP) as a side effect. To reduce the hardware latency, this option is not recommended for real-time workloads.

  • Replace intel_idle.max_cstate=0 processor.max_cstate=1 with idle=poll. To avoid C-State transitions, the idle=poll option is used to disable the C-State transitions and keep the CPU in the highest C-State. The intel_idle.max_cstate=0 option disables intel_idle, so acpi_idle is used, and acpi_idle.max_cstate=1 then sets max C-state for acpi_idle. On AMD64/Intel 64 architectures, the first ACPI C-State is always POLL, but it uses a poll_idle() function, which may introduce some tiny latency by reading the clock periodically, and restarting the main loop in do_idle() after a timeout (this also involves clearing and setting the TIF_POLL task flag). In contrast, idle=poll runs in a tight loop, busy-waiting for a task to be rescheduled. This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread.

  • Disable C1E in BIOS. This option is important to disable the C1E state in the BIOS to avoid the CPU from entering the C1E state when idle. The C1E state is a low-power state that can introduce latency when the CPU is idle.

The rest of this documentation covers additional parameters, including huge pages and IOMMU.

This provides an example of kernel arguments for a 32-core Intel server, including the aforementioned adjustments:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 skew_tick=1 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 ignition.platform.id=openstack net.ifnames=1 intel_iommu=on iommu=pt irqaffinity=0,31,32,63 isolcpus=domain,nohz,managed_irq,1-30,33-62 nohz_full=1-30,33-62 nohz=on mce=off nosoftlockup nowatchdog nmi_watchdog=0 quiet rcu_nocb_poll rcu_nocbs=1-30,33-62 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1 idle=poll

Here is another configuration example for a 64-core AMD server. Among the 128 logical processors (0-127), first 8 cores (0-7) are designated for housekeeping, while the remaining 120 cores (8-127) are pinned for the applications:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=575291cf-74e8-42cf-8f2c-408a20dc00b8 skew_tick=1 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 ignition.platform.id=openstack net.ifnames=1 amd_iommu=on iommu=pt irqaffinity=0-7 isolcpus=domain,nohz,managed_irq,8-127 nohz_full=8-127 rcu_nocbs=8-127 mce=off nohz=on nowatchdog nmi_watchdog=0 nosoftlockup quiet rcu_nocb_poll rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1 idle=poll