54 Telco features (DPDK, SR-IOV, CPU isolation, huge pages, NUMA, etc.) #
The directed network provisioning workflow allows to automate the Telco features to be used in the downstream clusters to run Telco workloads on top of those servers.
Requirements
The image generated using
EIB, as described in the previous section (Chapter 49, Prepare downstream cluster image for connected scenarios), has to be located in the management cluster exactly on the path you configured on this section (Note).The image generated using
EIBhas to include the specific Telco packages following this section (Section 49.2.5, “Additional configuration for Telco workloads”).The management server created and available to be used on the following sections. For more information, refer to the Management Cluster section: Part V, “Setting up the management cluster”.
Configuration
Use the following two sections as the base to enroll and provision the hosts:
Downstream cluster provisioning with Directed network provisioning (single-node) (Chapter 51, Downstream cluster provisioning with Directed network provisioning (single-node))
Downstream cluster provisioning with Directed network provisioning (multi-node) (Chapter 52, Downstream cluster provisioning with Directed network provisioning (multi-node))
The Telco features covered in this section are the following:
DPDK and VFs creation
SR-IOV and VFs allocation to be used by the workloads
CPU isolation and performance tuning
Huge pages configuration
Kernel parameters tuning
The changes required to enable the Telco features shown above are all inside the RKE2ControlPlane block in the provision file capi-provisioning-example.yaml. The rest of the information inside the file capi-provisioning-example.yaml is the same as the information provided in the provisioning section (Chapter 51, Downstream cluster provisioning with Directed network provisioning (single-node)).
To make the process clear, the changes required on that block (RKE2ControlPlane) to enable the Telco features are the following:
The ignition file
/var/lib/rancher/rke2/server/manifests/configmap-sriov-custom-auto.yamlto be used to define the interfaces, drivers and the number ofVFsto be created and exposed to the workloads.The values inside the config map
sriov-custom-auto-configare the only values to be replaced with real values.${RESOURCE_NAME1}— The resource name to be used for the firstPFinterface (for example,sriov-resource-du1). It is added to the prefixrancher.ioto be used as a label to be used by the workloads (for example,rancher.io/sriov-resource-du1).${SRIOV-NIC-NAME1}— The name of the firstPFinterface to be used (for example,eth0).${PF_NAME1}— The name of the first physical functionPFto be used. Generate more complex filters using this (for example,eth0#2-5).${DRIVER_NAME1}— The driver name to be used for the firstVFinterface (for example,vfio-pci).${NUM_VFS1}— The number ofVFsto be created for the firstPFinterface (for example,8).
The
/var/sriov-auto-filler.shto be used as a translator between the high-level config mapsriov-custom-auto-configand thesriovnetworknodepolicywhich contains the low-level hardware information. This script has been created to abstract the user from the complexity to know in advance the hardware information. No changes are required in this file, but it should be present if we need to enablesr-iovand createVFs.The kernel arguments to be used to enable the following features:
Parameter | Value | Description |
isolcpus | domain,nohz,managed_irq,1-30,33-62 | Isolate the cores 1-30 and 33-62. |
skew_tick | 1 | Allows the kernel to skew the timer interrupts across the isolated CPUs. |
nohz | on | Allows the kernel to run the timer tick on a single CPU when the system is idle. |
nohz_full | 1-30,33-62 | kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation. |
rcu_nocbs | 1-30,33-62 | Allows the kernel to run the RCU callbacks on a single CPU when the system is idle. |
irqaffinity | 0,31,32,63 | Allows the kernel to run the interrupts on a single CPU when the system is idle. |
idle | poll | Minimizes the latency of exiting the idle state. |
iommu | pt | Allows to use vfio for the dpdk interfaces. |
intel_iommu | on | Enables the use of vfio for VFs. |
hugepagesz | 1G | Allows to set the size of huge pages to 1 G. |
hugepages | 40 | Number of huge pages defined before. |
default_hugepagesz | 1G | Default value to enable huge pages. |
nowatchdog | Disables the watchdog. | |
nmi_watchdog | 0 | Disables the NMI watchdog. |
The following systemd services are used to enable the following:
rke2-preinstall.serviceto replace automatically theBAREMETALHOST_UUIDandnode-nameduring the provisioning process using the Ironic information.cpu-partitioning.serviceto enable the isolation cores of theCPU(for example,1-30,33-62).performance-settings.serviceto enable the CPU performance tuning.sriov-custom-auto-vfs.serviceto install thesriovHelm chart, wait until custom resources are created and run the/var/sriov-auto-filler.shto replace the values in the config mapsriov-custom-auto-configand create thesriovnetworknodepolicyto be used by the workloads.
The
${RKE2_VERSION}is the version ofRKE2to be used replacing this value (for example,v1.35.3+rke2r3).
With all these changes mentioned, the RKE2ControlPlane block in the capi-provisioning-example.yaml will look like the following:
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: RKE2ControlPlane
metadata:
name: single-node-cluster
namespace: default
spec:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: Metal3MachineTemplate
name: single-node-cluster-controlplane
replicas: 1
version: ${RKE2_VERSION}
rolloutStrategy:
type: "RollingUpdate"
rollingUpdate:
maxSurge: 0
serverConfig:
cni: calico
cniMultusEnable: true
agentConfig:
format: ignition
additionalUserData:
config: |
variant: fcos
version: 1.4.0
storage:
files:
- path: /var/lib/rancher/rke2/server/manifests/configmap-sriov-custom-auto.yaml
overwrite: true
contents:
inline: |
apiVersion: v1
kind: ConfigMap
metadata:
name: sriov-custom-auto-config
namespace: kube-system
data:
config.json: |
[
{
"resourceName": "${RESOURCE_NAME1}",
"interface": "${SRIOV-NIC-NAME1}",
"pfname": "${PF_NAME1}",
"driver": "${DRIVER_NAME1}",
"numVFsToCreate": ${NUM_VFS1}
},
{
"resourceName": "${RESOURCE_NAME2}",
"interface": "${SRIOV-NIC-NAME2}",
"pfname": "${PF_NAME2}",
"driver": "${DRIVER_NAME2}",
"numVFsToCreate": ${NUM_VFS2}
}
]
mode: 0644
user:
name: root
group:
name: root
- path: /var/lib/rancher/rke2/server/manifests/sriov-crd.yaml
overwrite: true
contents:
inline: |
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: sriov-crd
namespace: kube-system
spec:
chart: oci://registry.suse.com/edge/charts/sriov-crd
targetNamespace: sriov-network-operator
version: 306.0.4+up1.6.0
createNamespace: true
- path: /var/lib/rancher/rke2/server/manifests/sriov-network-operator.yaml
overwrite: true
contents:
inline: |
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: sriov-network-operator
namespace: kube-system
spec:
chart: oci://registry.suse.com/edge/charts/sriov-network-operator
targetNamespace: sriov-network-operator
version: 306.0.4+up1.6.0
createNamespace: true
kernel_arguments:
should_exist:
- intel_iommu=on
- iommu=pt
- idle=poll
- mce=off
- hugepagesz=1G hugepages=40
- hugepagesz=2M hugepages=0
- default_hugepagesz=1G
- irqaffinity=${NON-ISOLATED_CPU_CORES}
- isolcpus=domain,nohz,managed_irq,${ISOLATED_CPU_CORES}
- nohz_full=${ISOLATED_CPU_CORES}
- rcu_nocbs=${ISOLATED_CPU_CORES}
- rcu_nocb_poll
- nosoftlockup
- nowatchdog
- nohz=on
- nmi_watchdog=0
- skew_tick=1
- quiet
systemd:
units:
- name: rke2-preinstall.service
enabled: true
contents: |
[Unit]
Description=rke2-preinstall
Wants=network-online.target
Before=rke2-install.service
ConditionPathExists=!/run/cluster-api/bootstrap-success.complete
[Service]
Type=oneshot
User=root
ExecStartPre=/bin/sh -c "mount -L config-2 /mnt"
ExecStart=/bin/sh -c "sed -i \"s/BAREMETALHOST_UUID/$(jq -r .uuid /mnt/openstack/latest/meta_data.json)/\" /etc/rancher/rke2/config.yaml"
ExecStart=/bin/sh -c "echo \"node-name: $(jq -r .name /mnt/openstack/latest/meta_data.json)\" >> /etc/rancher/rke2/config.yaml"
ExecStartPost=/bin/sh -c "umount /mnt"
[Install]
WantedBy=multi-user.target
# rke2-traefik-deployment.service unit to be removed once "traefik" being the default ingress controller (starting with RKE2 v1.36)
- name: rke2-traefik-deployment.service
enabled: true
contents: |
[Unit]
Description=rke2-traefik-deployment
Wants=rke2-preinstall.service
Before=rke2-install.service
ConditionPathExists=!/run/cluster-api/bootstrap-success.complete
[Service]
Type=oneshot
User=root
ExecStart=/bin/sh -c "echo \"ingress-controller: traefik\" >> /etc/rancher/rke2/config.yaml"
[Install]
WantedBy=multi-user.target
- name: cpu-partitioning.service
enabled: true
contents: |
[Unit]
Description=cpu-partitioning
Wants=network-online.target
After=network.target network-online.target
[Service]
Type=oneshot
User=root
ExecStart=/bin/sh -c "echo isolated_cores=${ISOLATED_CPU_CORES} > /etc/tuned/cpu-partitioning-variables.conf"
ExecStartPost=/bin/sh -c "tuned-adm profile cpu-partitioning"
ExecStartPost=/bin/sh -c "systemctl enable tuned.service"
[Install]
WantedBy=multi-user.target
- name: performance-settings.service
enabled: true
contents: |
[Unit]
Description=performance-settings
Wants=network-online.target
After=network.target network-online.target cpu-partitioning.service
[Service]
Type=oneshot
User=root
ExecStart=/bin/sh -c "/opt/performance-settings/performance-settings.sh"
[Install]
WantedBy=multi-user.target
- name: sriov-custom-auto-vfs.service
enabled: true
contents: |
[Unit]
Description=SRIOV Custom Auto VF Creation
Wants=network-online.target rke2-server.target
After=network.target network-online.target rke2-server.target
[Service]
User=root
Type=forking
TimeoutStartSec=900
ExecStart=/bin/sh -c "while ! /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml wait --for condition=ready nodes --all ; do sleep 2 ; done"
ExecStartPost=/bin/sh -c "while [ $(/var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml get sriovnetworknodestates.sriovnetwork.openshift.io --ignore-not-found --no-headers -A | wc -l) -eq 0 ]; do sleep 1; done"
ExecStartPost=/bin/sh -c "/opt/sriov/sriov-auto-filler.sh"
RemainAfterExit=yes
KillMode=process
[Install]
WantedBy=multi-user.target
kubelet:
extraArgs:
- provider-id=metal3://BAREMETALHOST_UUID
nodeName: "localhost.localdomain"Once the file is created by joining the previous blocks, the following command must be executed in the management cluster to start provisioning the new downstream cluster using the Telco features:
$ kubectl apply -f capi-provisioning-example.yaml