33 Telco features configuration #
This section documents and explains the configuration of Telco-specific features on ATIP-deployed clusters.
The directed network provisioning deployment method is used, as described in the ATIP Automated Provision (Chapter 34, Fully automated directed network provisioning) section.
The following topics are covered in this section:
Kernel image for real time (Section 33.1, “Kernel image for real time”): Kernel image to be used by the real-time kernel.
Kernel arguments for low latency and high performance (Section 33.2, “Kernel arguments for low latency and high performance”): Kernel arguments to be used by the real-time kernel for maximum performance and low latency running telco workloads.
CPU tuned configuration (Section 33.3, “CPU tuned configuration”): Tuned configuration to be used by the real-time kernel.
CNI configuration (Section 33.4, “CNI Configuration”): CNI configuration to be used by the Kubernetes cluster.
SR-IOV configuration (Section 33.5, “SR-IOV”): SR-IOV configuration to be used by the Kubernetes workloads.
DPDK configuration (Section 33.6, “DPDK”): DPDK configuration to be used by the system.
vRAN acceleration card (Section 33.7, “vRAN acceleration (
Intel ACC100/ACC200
)”): Acceleration card configuration to be used by the Kubernetes workloads.Huge pages (Section 33.8, “Huge pages”): Huge pages configuration to be used by the Kubernetes workloads.
CPU pinning configuration (Section 33.9, “CPU pinning configuration”): CPU pinning configuration to be used by the Kubernetes workloads.
NUMA-aware scheduling configuration (Section 33.10, “NUMA-aware scheduling”): NUMA-aware scheduling configuration to be used by the Kubernetes workloads.
Metal LB configuration (Section 33.11, “Metal LB”): Metal LB configuration to be used by the Kubernetes workloads.
Private registry configuration (Section 33.12, “Private registry configuration”): Private registry configuration to be used by the Kubernetes workloads.
33.1 Kernel image for real time #
The real-time kernel image is not necessarily better than a standard kernel. It is a different kernel tuned to a specific use case. The real-time kernel is tuned for lower latency at the cost of throughput. The real-time kernel is not recommended for general purpose use, but in our case, this is the recommended kernel for Telco Workloads where latency is a key factor.
There are four top features:
Deterministic execution:
Get greater predictability — ensure critical business processes complete in time, every time and deliver high-quality service, even under heavy system loads. By shielding key system resources for high-priority processes, you can ensure greater predictability for time-sensitive applications.
Low jitter:
The low jitter built upon the highly deterministic technology helps to keep applications synchronized with the real world. This helps services that need ongoing and repeated calculation.
Priority inheritance:
Priority inheritance refers to the ability of a lower priority process to assume a higher priority when there is a higher priority process that requires the lower priority process to finish before it can accomplish its task. SUSE Linux Enterprise Real Time solves these priority inversion problems for mission-critical processes.
Thread interrupts:
Processes running in interrupt mode in a general-purpose operating system are not preemptible. With SUSE Linux Enterprise Real Time, these interrupts have been encapsulated by kernel threads, which are interruptible, and allow the hard and soft interrupts to be preempted by user-defined higher priority processes.
In our case, if you have installed a real-time image like
SLE Micro RT
, kernel real time is already installed. From the SUSE Customer Center, you can download the real-time kernel image.NoteFor more information about the real-time kernel, visit SUSE Real Time.
33.2 Kernel arguments for low latency and high performance #
The kernel arguments are important to be configured to enable the real-time kernel to work properly giving the best performance and low latency to run telco workloads. There are some important concepts to keep in mind when configuring the kernel arguments for this use case:
Remove
kthread_cpus
when using SUSE real-time kernel. This parameter controls on which CPUs kernel threads are created. It also controls which CPUs are allowed for PID 1 and for loading kernel modules (the kmod user-space helper). This parameter is not recognized and does not have any effect.Add
domain,nohz,managed_irq
flags toisolcpus
kernel argument. Without any flags,isolcpus
is equivalent to specifying only thedomain
flag. This isolates the specified CPUs from scheduling, including kernel tasks. Thenohz
flag stops the scheduler tick on the specified CPUs (if only one task is runnable on a CPU), and themanaged_irq
flag avoids routing managed external (device) interrupts at the specified CPUs.Remove
intel_pstate=passive
. This option configuresintel_pstate
to work with generic cpufreq governors, but to make this work, it disables hardware-managed P-states (HWP
) as a side effect. To reduce the hardware latency, this option is not recommended for real-time workloads.Replace
intel_idle.max_cstate=0 processor.max_cstate=1
withidle=poll
. To avoid C-State transitions, theidle=poll
option is used to disable the C-State transitions and keep the CPU in the highest C-State. Theintel_idle.max_cstate=0
option disablesintel_idle
, soacpi_idle
is used, andacpi_idle.max_cstate=1
then sets max C-state for acpi_idle. On x86_64 architectures, the first ACPI C-State is alwaysPOLL
, but it uses apoll_idle()
function, which may introduce some tiny latency by reading the clock periodically, and restarting the main loop indo_idle()
after a timeout (this also involves clearing and setting theTIF_POLL
task flag). In contrast,idle=poll
runs in a tight loop, busy-waiting for a task to be rescheduled. This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread.Disable C1E in BIOS. This option is important to disable the C1E state in the BIOS to avoid the CPU from entering the C1E state when idle. The C1E state is a low-power state that can introduce latency when the CPU is idle.
Add
nowatchdog
to disable the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. When it expires (i.e. a soft lockup is detected), it will print a warning (in the hard interrupt context), running any latency targets. Even if it never expires, it goes onto the timer list, slightly increasing the overhead of every timer interrupt. This option also disables the NMI watchdog, so NMIs cannot interfere.Add
nmi_watchdog=0
. This option disables only the NMI watchdog.
This is an example of the kernel argument list including the aforementioned adjustments:
GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"
33.3 CPU tuned configuration #
The CPU Tuned configuration allows the possibility to isolate the CPU cores to be used by the real-time kernel. It is important to prevent the OS from using the same cores as the real-time kernel, because the OS could use the cores and increase the latency in the real-time kernel.
To enable and configure this feature, the first thing is to create a profile for the CPU cores we want to isolate. In this case, we are isolating the cores 1-30
and 33-62
.
$ echo "export tuned_params" >> /etc/grub.d/00_tuned $ echo "isolated_cores=1-18,21-38" >> /etc/tuned/cpu-partitioning-variables.conf $ tuned-adm profile cpu-partitioning Tuned (re)started, changes applied.
Then we need to modify the GRUB option to isolate CPU cores and other important parameters for CPU usage. The following options are important to be customized with your current hardware specifications:
parameter | value | description |
---|---|---|
isolcpus | domain,nohz,managed_irq,1-18,21-38 | Isolate the cores 1-18 and 21-38 |
skew_tick | 1 | This option allows the kernel to skew the timer interrupts across the isolated CPUs. |
nohz | on | This option allows the kernel to run the timer tick on a single CPU when the system is idle. |
nohz_full | 1-18,21-38 | kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation. |
rcu_nocbs | 1-18,21-38 | This option allows the kernel to run the RCU callbacks on a single CPU when the system is idle. |
irqaffinity | 0,19,20,39 | This option allows the kernel to run the interrupts on a single CPU when the system is idle. |
idle | poll | This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread. |
nmi_watchdog | 0 | This option disables only the NMI watchdog. |
nowatchdog | This option disables the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. |
With the values shown above, we are isolating 60 cores, and we are using four cores for the OS.
The following commands modify the GRUB configuration and apply the changes mentioned above to be present on the next boot:
Edit the /etc/default/grub
file and add the parameters mentioned above:
GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"
Update the GRUB configuration:
$ transactional-update grub.cfg $ reboot
To validate that the parameters are applied after the reboot, the following command can be used to check the kernel command line:
$ cat /proc/cmdline
There is another script that can be used to tune the CPU configuration, which basically is doing the following steps:
Set the CPU governor to
performance
.Unset the timer migration to the isolated CPUs.
Migrate the kdaemon threads to the housekeeping CPUs.
Set the isolated CPUs latency to the lowest possible value.
Delay the vmstat updates to 300 seconds.
The script is available at SUSE ATIP Github repository - performance-settings.sh.
33.4 CNI Configuration #
33.4.1 Cilium #
Cilium
is the default CNI plug-in for ATIP.
To enable Cilium on RKE2 cluster as the default plug-in, the following configurations are required in the /etc/rancher/rke2/config.yaml
file:
cni:
- cilium
This can also be specified with command-line arguments, that is, --cni=cilium
into the server line in /etc/systemd/system/rke2-server
file.
To use the SR-IOV
network operator described in the next section (Section 33.5, “SR-IOV”), use Multus
with another CNI plug-in, like Cilium
or Calico
, as a secondary plug-in.
cni:
- multus
- cilium
For more information about CNI plug-ins, visit Network Options.
33.5 SR-IOV #
SR-IOV allows a device, such as a network adapter, to separate access to its resources among various PCIe
hardware functions.
There are different ways to deploy SR-IOV
, and here, we show two different options:
Option 1: using the
SR-IOV
CNI device plug-ins and a config map to configure it properly.Option 2 (recommended): using the
SR-IOV
Helm chart from Rancher Prime to make this deployment easy.
Option 1 - Installation of SR-IOV CNI device plug-ins and a config map to configure it properly
Prepare the config map for the device plug-in
Get the information to fill the config map from the lspci
command:
$ lspci | grep -i acc 8a:00.0 Processing accelerators: Intel Corporation Device 0d5c $ lspci | grep -i net 19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) 19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) 19:00.2 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) 19:00.3 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11) 51:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 51:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 51:01.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:01.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:01.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:01.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:11.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:11.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:11.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02) 51:11.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
The config map consists of a JSON
file that describes devices using filters to discover, and creates groups for the interfaces.
The key is understanding filters and groups. The filters are used to discover the devices and the groups are used to create the interfaces.
It could be possible to set filters:
vendorID:
8086
(Intel)deviceID:
0d5c
(Accelerator card)driver:
vfio-pci
(driver)pfNames:
p2p1
(physical interface name)
It could be possible to also set filters to match more complex interface syntax, for example:
pfNames:
["eth1#1,2,3,4,5,6"]
or[eth1#1-6]
(physical interface name)
Related to the groups, we could create a group for the FEC
card and another group for the Intel
card, even creating a prefix depending on our use case:
resourceName:
pci_sriov_net_bh_dpdk
resourcePrefix:
Rancher.io
There are a lot of combinations to discover and create the resource group to allocate some VFs
to the pods.
For more information about the filters and groups, visit sr-iov network device plug-in.
After setting the filters and groups to match the interfaces depending on the hardware and the use case, the following config map shows an example to be used:
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourceName": "intel_fec_5g",
"devicetype": "accelerator",
"selectors": {
"vendors": ["8086"],
"devices": ["0d5d"]
}
},
{
"resourceName": "intel_sriov_odu",
"selectors": {
"vendors": ["8086"],
"devices": ["1889"],
"drivers": ["vfio-pci"],
"pfNames": ["p2p1"]
}
},
{
"resourceName": "intel_sriov_oru",
"selectors": {
"vendors": ["8086"],
"devices": ["1889"],
"drivers": ["vfio-pci"],
"pfNames": ["p2p2"]
}
}
]
}
Prepare the
daemonset
file to deploy the device plug-in.
The device plug-in supports several architectures (arm
, amd
, ppc64le
), so the same file can be used for different architectures deploying several daemonset
for each architecture.
apiVersion: v1
kind: ServiceAccount
metadata:
name: sriov-device-plugin
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-sriov-device-plugin-amd64
namespace: kube-system
labels:
tier: node
app: sriovdp
spec:
selector:
matchLabels:
name: sriov-device-plugin
template:
metadata:
labels:
name: sriov-device-plugin
tier: node
app: sriovdp
spec:
hostNetwork: true
nodeSelector:
kubernetes.io/arch: amd64
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
serviceAccountName: sriov-device-plugin
containers:
- name: kube-sriovdp
image: rancher/hardened-sriov-network-device-plugin:v3.7.0-build20240816
imagePullPolicy: IfNotPresent
args:
- --log-dir=sriovdp
- --log-level=10
securityContext:
privileged: true
resources:
requests:
cpu: "250m"
memory: "40Mi"
limits:
cpu: 1
memory: "200Mi"
volumeMounts:
- name: devicesock
mountPath: /var/lib/kubelet/
readOnly: false
- name: log
mountPath: /var/log
- name: config-volume
mountPath: /etc/pcidp
- name: device-info
mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
volumes:
- name: devicesock
hostPath:
path: /var/lib/kubelet/
- name: log
hostPath:
path: /var/log
- name: device-info
hostPath:
path: /var/run/k8s.cni.cncf.io/devinfo/dp
type: DirectoryOrCreate
- name: config-volume
configMap:
name: sriovdp-config
items:
- key: config.json
path: config.json
After applying the config map and the
daemonset
, the device plug-in will be deployed and the interfaces will be discovered and available for the pods.$ kubectl get pods -n kube-system | grep sriov kube-system kube-sriov-device-plugin-amd64-twjfl 1/1 Running 0 2m
Check the interfaces discovered and available in the nodes to be used by the pods:
$ kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq { "cpu": "64", "ephemeral-storage": "256196109726", "hugepages-1Gi": "40Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_5g": "1", "intel.com/intel_sriov_odu": "4", "intel.com/intel_sriov_oru": "4", "memory": "221396384Ki", "pods": "110" }
The
FEC
isintel.com/intel_fec_5g
and the value is 1.The
VF
isintel.com/intel_sriov_odu
orintel.com/intel_sriov_oru
if you deploy it with a device plug-in and the config map without Helm charts.
If there are no interfaces here, it makes little sense to continue because the interface will not be available for pods. Review the config map and filters to solve the issue first.
Option 2 (recommended) - Installation using Rancher using Helm chart for SR-IOV CNI and device plug-ins
Get Helm if not present:
$ curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Install SR-IOV.
This part could be done in two ways, using the CLI
or using the Rancher UI
.
- Install Operator from CLI
helm install sriov-crd oci://registry.suse.com/edge/3.1/sriov-crd-chart -n sriov-network-operator helm install sriov-network-operator oci://registry.suse.com/edge/3.1/sriov-network-operator-chart -n sriov-network-operator
- Install Operator from Rancher UI
Once your cluster is installed, and you have access to the
Rancher UI
, you can install theSR-IOV Operator
from theRancher UI
from the apps tab:
Make sure you select the right namespace to install the operator, for example, sriov-network-operator
.
+ image::features_sriov.png[sriov.png]
Check the deployed resources crd and pods:
$ kubectl get crd $ kubectl -n sriov-network-operator get pods
Check the label in the nodes.
With all resources running, the label appears automatically in your node:
$ kubectl get nodes -oyaml | grep feature.node.kubernetes.io/network-sriov.capable feature.node.kubernetes.io/network-sriov.capable: "true"
Review the
daemonset
to see the newsriov-network-config-daemon
andsriov-rancher-nfd-worker
as active and ready:
$ kubectl get daemonset -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE calico-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 15h sriov-network-operator sriov-network-config-daemon 1 1 1 1 1 feature.node.kubernetes.io/network-sriov.capable=true 45m sriov-network-operator sriov-rancher-nfd-worker 1 1 1 1 1 <none> 45m kube-system rke2-ingress-nginx-controller 1 1 1 1 1 kubernetes.io/os=linux 15h kube-system rke2-multus-ds 1 1 1 1 1 kubernetes.io/arch=amd64,kubernetes.io/os=linux 15h
In a few minutes (can take up to 10 min to be updated), the nodes are detected and configured with the SR-IOV
capabilities:
$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -A NAMESPACE NAME AGE sriov-network-operator xr11-2 83s
Check the interfaces detected.
The interfaces discovered should be the PCI address of the network device. Check this information with the lspci
command in the host.
$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n kube-system -oyaml apiVersion: v1 items: - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState metadata: creationTimestamp: "2023-06-07T09:52:37Z" generation: 1 name: xr11-2 namespace: sriov-network-operator ownerReferences: - apiVersion: sriovnetwork.openshift.io/v1 blockOwnerDeletion: true controller: true kind: SriovNetworkNodePolicy name: default uid: 80b72499-e26b-4072-a75c-f9a6218ec357 resourceVersion: "356603" uid: e1f1654b-92b3-44d9-9f87-2571792cc1ad spec: dpConfigVersion: "356507" status: interfaces: - deviceID: "1592" driver: ice eSwitchMode: legacy linkType: ETH mac: 40:a6:b7:9b:35:f0 mtu: 1500 name: p2p1 pciAddress: "0000:51:00.0" totalvfs: 128 vendor: "8086" - deviceID: "1592" driver: ice eSwitchMode: legacy linkType: ETH mac: 40:a6:b7:9b:35:f1 mtu: 1500 name: p2p2 pciAddress: "0000:51:00.1" totalvfs: 128 vendor: "8086" syncStatus: Succeeded kind: List metadata: resourceVersion: ""
If your interface is not detected here, ensure that it is present in the next config map:
$ kubectl get cm supported-nic-ids -oyaml -n sriov-network-operator
If your device is not there, edit the config map, adding the right values to be discovered (should be necessary to restart the sriov-network-config-daemon
daemonset).
Create the
NetworkNode Policy
to configure theVFs
.
Some VFs
(numVfs
) from the device (rootDevices
) will be created, and it will be configured with the driver deviceType
and the MTU
:
The resourceName
field must not contain any special characters and must be unique across the cluster.
The example uses the deviceType: vfio-pci
because dpdk
will be used in combination with sr-iov
. If you don’t use dpdk
, the deviceType should be deviceType: netdevice
(default value).
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-dpdk
namespace: sriov-network-operator
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
resourceName: intelnicsDpdk
deviceType: vfio-pci
numVfs: 8
mtu: 1500
nicSelector:
deviceID: "1592"
vendor: "8086"
rootDevices:
- 0000:51:00.0
Validate configurations:
$ kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq { "cpu": "64", "ephemeral-storage": "256196109726", "hugepages-1Gi": "60Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_5g": "1", "memory": "200424836Ki", "pods": "110", "rancher.io/intelnicsDpdk": "8" }
Create the sr-iov network (optional, just in case a different network is needed):
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: network-dpdk
namespace: sriov-network-operator
spec:
ipam: |
{
"type": "host-local",
"subnet": "192.168.0.0/24",
"rangeStart": "192.168.0.20",
"rangeEnd": "192.168.0.60",
"routes": [{
"dst": "0.0.0.0/0"
}],
"gateway": "192.168.0.1"
}
vlan: 500
resourceName: intelnicsDpdk
Check the network created:
$ kubectl get network-attachment-definitions.k8s.cni.cncf.io -A -oyaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: rancher.io/intelnicsDpdk creationTimestamp: "2023-06-08T11:22:27Z" generation: 1 name: network-dpdk namespace: sriov-network-operator resourceVersion: "13124" uid: df7c89f5-177c-4f30-ae72-7aef3294fb15 spec: config: '{ "cniVersion":"0.4.0", "name":"network-dpdk","type":"sriov","vlan":500,"vlanQoS":0,"ipam":{"type":"host-local","subnet":"192.168.0.0/24","rangeStart":"192.168.0.10","rangeEnd":"192.168.0.60","routes":[{"dst":"0.0.0.0/0"}],"gateway":"192.168.0.1"} }' kind: List metadata: resourceVersion: ""
33.6 DPDK #
DPDK
(Data Plane Development Kit) is a set of libraries and drivers for fast packet processing. It is used to accelerate packet processing workloads running on a wide variety of CPU architectures.
The DPDK includes data plane libraries and optimized network interface controller (NIC
) drivers for the following:
A queue manager implements lockless queues.
A buffer manager pre-allocates fixed size buffers.
A memory manager allocates pools of objects in memory and uses a ring to store free objects; ensures that objects are spread equally on all
DRAM
channels.Poll mode drivers (
PMD
) are designed to work without asynchronous notifications, reducing overhead.A packet framework as a set of libraries that are helpers to develop packet processing.
The following steps will show how to enable DPDK
and how to create VFs
from the NICs
to be used by the DPDK
interfaces:
Install the
DPDK
package:
$ transactional-update pkg install dpdk dpdk-tools libdpdk-23 $ reboot
Kernel parameters:
To use DPDK, employ some drivers to enable certain parameters in the kernel:
parameter | value | description |
---|---|---|
iommu | pt | This option enables the use of the |
intel_iommu | on | This option enables the use of |
To enable the parameters, add them to the /etc/default/grub
file:
GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"
Update the GRUB configuration and reboot the system to apply the changes:
$ transactional-update grub.cfg $ reboot
Load
vfio-pci
kernel module and enableSR-IOV
on theNICs
:
$ modprobe vfio-pci enable_sriov=1 disable_idle_d3=1
Create some virtual functions (
VFs
) from theNICs
.
To create for VFs
, for example, for two different NICs
, the following commands are required:
$ echo 4 > /sys/bus/pci/devices/0000:51:00.0/sriov_numvfs $ echo 4 > /sys/bus/pci/devices/0000:51:00.1/sriov_numvfs
Bind the new VFs with the
vfio-pci
driver:
$ dpdk-devbind.py -b vfio-pci 0000:51:01.0 0000:51:01.1 0000:51:01.2 0000:51:01.3 \ 0000:51:11.0 0000:51:11.1 0000:51:11.2 0000:51:11.3
Review the configuration is correctly applied:
$ dpdk-devbind.py -s Network devices using DPDK-compatible driver ============================================ 0000:51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:01.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:01.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:01.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:11.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:21.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio 0000:51:31.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio Network devices using kernel driver =================================== 0000:19:00.0 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em1 drv=bnxt_en unused=igb_uio,vfio-pci *Active* 0000:19:00.1 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em2 drv=bnxt_en unused=igb_uio,vfio-pci 0000:19:00.2 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em3 drv=bnxt_en unused=igb_uio,vfio-pci 0000:19:00.3 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em4 drv=bnxt_en unused=igb_uio,vfio-pci 0000:51:00.0 'Ethernet Controller E810-C for QSFP 1592' if=eth13 drv=ice unused=igb_uio,vfio-pci 0000:51:00.1 'Ethernet Controller E810-C for QSFP 1592' if=rename8 drv=ice unused=igb_uio,vfio-pci
33.7 vRAN acceleration (Intel ACC100/ACC200
) #
As communications service providers move from 4 G to 5 G networks, many are adopting virtualized radio access network (vRAN
) architectures for higher channel capacity and easier deployment of edge-based services and applications. vRAN solutions are ideally located to deliver low-latency services with the flexibility to increase or decrease capacity based on the volume of real-time traffic and demand on the network.
One of the most compute-intensive 4 G and 5 G workloads is RAN layer 1 (L1
) FEC
, which resolves data transmission errors over unreliable or noisy communication channels. FEC
technology detects and corrects a limited number of errors in 4 G or 5 G data, eliminating the need for retransmission. Since the FEC
acceleration transaction does not contain cell state information, it can be easily virtualized, enabling pooling benefits and easy cell migration.
Kernel parameters
To enable the vRAN
acceleration, we need to enable the following kernel parameters (if not present yet):
parameter | value | description |
---|---|---|
iommu | pt | This option enables the use of vfio for the DPDK interfaces. |
intel_iommu | on | This option enables the use of vfio for VFs. |
Modify the GRUB file /etc/default/grub
to add them to the kernel command line:
GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"
Update the GRUB configuration and reboot the system to apply the changes:
$ transactional-update grub.cfg $ reboot
To verify that the parameters are applied after the reboot, check the command line:
$ cat /proc/cmdline
Load vfio-pci kernel modules to enable the
vRAN
acceleration:
$ modprobe vfio-pci enable_sriov=1 disable_idle_d3=1
Get interface information Acc100:
$ lspci | grep -i acc 8a:00.0 Processing accelerators: Intel Corporation Device 0d5c
Bind the physical interface (
PF
) withvfio-pci
driver:
$ dpdk-devbind.py -b vfio-pci 0000:8a:00.0
Create the virtual functions (
VFs
) from the physical interface (PF
).
Create 2 VFs
from the PF
and bind with vfio-pci
following the next steps:
$ echo 2 > /sys/bus/pci/devices/0000:8a:00.0/sriov_numvfs $ dpdk-devbind.py -b vfio-pci 0000:8b:00.0
Configure acc100 with the proposed configuration file:
$ pf_bb_config ACC100 -c /opt/pf-bb-config/acc100_config_vf_5g.cfg Tue Jun 6 10:49:20 2023:INFO:Queue Groups: 2 5GUL, 2 5GDL, 2 4GUL, 2 4GDL Tue Jun 6 10:49:20 2023:INFO:Configuration in VF mode Tue Jun 6 10:49:21 2023:INFO: ROM version MM 99AD92 Tue Jun 6 10:49:21 2023:WARN:* Note: Not on DDR PRQ version 1302020 != 10092020 Tue Jun 6 10:49:21 2023:INFO:PF ACC100 configuration complete Tue Jun 6 10:49:21 2023:INFO:ACC100 PF [0000:8a:00.0] configuration complete!
Check the new VFs created from the FEC PF:
$ dpdk-devbind.py -s Baseband devices using DPDK-compatible driver ============================================= 0000:8a:00.0 'Device 0d5c' drv=vfio-pci unused= 0000:8b:00.0 'Device 0d5d' drv=vfio-pci unused= Other Baseband devices ====================== 0000:8b:00.1 'Device 0d5d' unused=
33.8 Huge pages #
When a process uses RAM
, the CPU
marks it as used by that process. For efficiency, the CPU
allocates RAM
in chunks 4K
bytes is the default value on many platforms. Those chunks are named pages. Pages can be swapped to disk, etc.
Since the process address space is virtual, the CPU
and the operating system need to remember which pages belong to which process, and where each page is stored. The greater the number of pages, the longer the search for memory mapping. When a process uses 1 GB
of memory, that is 262144 entries to look up (1 GB
/ 4 K
). If a page table entry consumes 8 bytes, that is 2 MB
(262144 * 8) to look up.
Most current CPU
architectures support larger-than-default pages, which give the CPU/OS
fewer entries to look up.
Kernel parameters
To enable the huge pages, we should add the next kernel parameters:
parameter | value | description |
---|---|---|
hugepagesz | 1G | This option allows to set the size of huge pages to 1 G |
hugepages | 40 | This is the number of huge pages defined before |
default_hugepagesz | 1G | This is the default value to get the huge pages |
Modify the GRUB file /etc/default/grub
to add them to the kernel command line:
GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"
Update the GRUB configuration and reboot the system to apply the changes:
$ transactional-update grub.cfg $ reboot
To validate that the parameters are applied after the reboot, you can check the command line:
$ cat /proc/cmdline
Using huge pages
To use the huge pages, we need to mount them:
$ mkdir -p /hugepages $ mount -t hugetlbfs nodev /hugepages
Deploy a Kubernetes workload, creating the resources and the volumes:
...
resources:
requests:
memory: "24Gi"
hugepages-1Gi: 16Gi
intel.com/intel_sriov_oru: '4'
limits:
memory: "24Gi"
hugepages-1Gi: 16Gi
intel.com/intel_sriov_oru: '4'
...
...
volumeMounts:
- name: hugepage
mountPath: /hugepages
...
volumes:
- name: hugepage
emptyDir:
medium: HugePages
...
33.9 CPU pinning configuration #
Requirements
Must have the
CPU
tuned to the performance profile covered in this section (Section 33.3, “CPU tuned configuration”).Must have the
RKE2
cluster kubelet configured with the CPU management arguments adding the following block (as an example) to the/etc/rancher/rke2/config.yaml
file:
kubelet-arg:
- "cpu-manager=true"
- "cpu-manager-policy=static"
- "cpu-manager-policy-options=full-pcpus-only=true"
- "cpu-manager-reconcile-period=0s"
- "kubelet-reserved=cpu=1"
- "system-reserved=cpu=1"
Using CPU pinning on Kubernetes
There are three ways to use that feature using the Static Policy
defined in kubelet depending on the requests and limits you define on your workload:
BestEffort
QoS Class: If you do not define any request or limit forCPU
, the pod is scheduled on the firstCPU
available on the system.An example of using the
BestEffort
QoS Class could be:spec: containers: - name: nginx image: nginx
Burstable
QoS Class: If you define a request for CPU, which is not equal to the limits, or there is no CPU request.Examples of using the
Burstable
QoS Class could be:spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" requests: memory: "100Mi"
or
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" requests: memory: "100Mi" cpu: "1"
Guaranteed
QoS Class: If you define a request for CPU, which is equal to the limits.An example of using the
Guaranteed
QoS Class could be:spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" requests: memory: "200Mi" cpu: "2"
33.10 NUMA-aware scheduling #
Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA
) is a physical memory design used in SMP
(multiprocessors) architecture, where the memory access time depends on the memory location relative to a processor. Under NUMA
, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.
33.10.1 Identifying NUMA nodes #
To identify the NUMA
nodes, on your system use the following command:
$ lscpu | grep NUMA NUMA node(s): 1 NUMA node0 CPU(s): 0-63
For this example, we have only one NUMA
node showing 64 CPUs
.
NUMA
needs to be enabled in the BIOS
. If dmesg
does not have records of NUMA initialization during the bootup, then NUMA
-related messages in the kernel ring buffer might have been overwritten.
33.11 Metal LB #
MetalLB
is a load-balancer implementation for bare-metal Kubernetes clusters, using standard routing protocols like L2
and BGP
as advertisement protocols. It is a network load balancer that can be used to expose services in a Kubernetes cluster to the outside world due to the need to use Kubernetes Services type LoadBalancer
with bare-metal.
To enable MetalLB
in the RKE2
cluster, the following steps are required:
Install
MetalLB
using the following command:
$ kubectl apply <<EOF -f apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: metallb namespace: kube-system spec: chart: oci://registry.suse.com/edge/3.1/metallb-chart targetNamespace: metallb-system version: 0.14.9 createNamespace: true --- apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: endpoint-copier-operator namespace: kube-system spec: chart: oci://registry.suse.com/edge/3.1/endpoint-copier-operator-chart targetNamespace: endpoint-copier-operator version: 0.2.1 createNamespace: true EOF
Create the
IpAddressPool
and theL2advertisement
configuration:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: kubernetes-vip-ip-pool
namespace: metallb-system
spec:
addresses:
- 10.168.200.98/32
serviceAllocation:
priority: 100
namespaces:
- default
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: ip-pool-l2-adv
namespace: metallb-system
spec:
ipAddressPools:
- kubernetes-vip-ip-pool
Create the endpoint service to expose the
VIP
:
apiVersion: v1
kind: Service
metadata:
name: kubernetes-vip
namespace: default
spec:
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: rke2-api
port: 9345
protocol: TCP
targetPort: 9345
- name: k8s-api
port: 6443
protocol: TCP
targetPort: 6443
sessionAffinity: None
type: LoadBalancer
Check the
VIP
is created and theMetalLB
pods are running:
$ kubectl get svc -n default $ kubectl get pods -n default
33.12 Private registry configuration #
Containerd
can be configured to connect to private registries and use them to pull private images on each node.
Upon startup, RKE2
checks if a registries.yaml
file exists at /etc/rancher/rke2/
and instructs containerd
to use any registries defined in the file. If you wish to use a private registry, create this file as root on each node that will use the registry.
To add the private registry, create the file /etc/rancher/rke2/registries.yaml
with the following content:
mirrors:
docker.io:
endpoint:
- "https://registry.example.com:5000"
configs:
"registry.example.com:5000":
auth:
username: xxxxxx # this is the registry username
password: xxxxxx # this is the registry password
tls:
cert_file: # path to the cert file used to authenticate to the registry
key_file: # path to the key file for the certificate used to authenticate to the registry
ca_file: # path to the ca file used to verify the registry's certificate
insecure_skip_verify: # may be set to true to skip verifying the registry's certificate
or without authentication:
mirrors:
docker.io:
endpoint:
- "https://registry.example.com:5000"
configs:
"registry.example.com:5000":
tls:
cert_file: # path to the cert file used to authenticate to the registry
key_file: # path to the key file for the certificate used to authenticate to the registry
ca_file: # path to the ca file used to verify the registry's certificate
insecure_skip_verify: # may be set to true to skip verifying the registry's certificate
For the registry changes to take effect, you need to either configure this file before starting RKE2 on the node, or restart RKE2 on each configured node.
For more information about this, please check containerd registry configuration rke2.