49 Troubleshooting Directed-network provisioning #

Directed-network provisioning scenarios involve using Metal³ and CAPI elements to provision the Downstream cluster. It also includes EIB to create an OS image. Issues can happen when the host is being booted for the first time or during the inspection or provisioning processes.

Common Issues #

Old firmware: Verify all the different firmware on the physical hosts being used are up to date. This includes the BMC firmware as some times Metal³ requires specific/updated ones.
Provisioning failed with SSL errors: If the webserver serving the images uses https, Metal³ needs to be configured to inject and trust the certificate on the IPA image. See Kubernetes folder (Section 40.3.4, “Kubernetes folder”) on how to include a ca-additional.crt file to the Metal³ chart.
Certificates issues when booting the hosts with IPA: Some server vendors verify the SSL connection when attaching virtual-media ISO images to the BMC, which can cause a problem because the generated certificates for the Metal3 deployment are self-signed. It can happen that the host is being booted but it drops to an UEFI shell. See Disabling TLS for virtualmedia ISO attachment (Section 1.6.2, “Disabling TLS for virtualmedia ISO attachment”) on how to fix it.
Wrong name or label reference: If the cluster references a node by the wrong name or label, the cluster results as deployed but the BMH remains as “Available”. Double-check the references on the involved objects for the BMHs.
BMC communication issues: Ensure the Metal³ pods running on the management cluster can reach the BMC of the hosts being provisioned (usually the BMC network is very restricted).
Incorrect bare metal host state: The BMH object goes to different states (inspecting, preparing, provisioned, etc.) during its lifetime Lifetime of State machine. If detected an incorrect state, check the status field of the BMH object as it contains more information as kubectl get bmh <name> -o jsonpath=’{.status}’| jq.
Host not being deprovisioned: In the event of a host being intended to be deprovisioned fails, the removal can be attempted after adding the “detached” annotation to the BMH object as: kubectl annotate bmh/<BMH> baremetalhost.metal3.io/detached=””.
Image errors: Verify the image being built with EIB for the downstream cluster is available, has a proper checksum and it is not too large to decompress or too large for disk.
Disk size mismatch: By default, the disk would not expand to fill the whole disk. As explained in the Growfs script (Section 1.3.4.1.2, “Growfs script”) section, a growfs script needs to be included in the image being built with EIB for the downstream cluster hosts.
Cleaning process stuck: The cleaning process is retried several times. If due to a problem with the host cleaning is no longer possible, disable cleaning first by setting the automatedCleanMode field to disabled on the BMH object.
Warning
It is not recommended to manually remove the finalizer when the cleaning process is taking longer than desired or is failing. Doing so, removes the host record from Kubernetes but leave it in Ironic. The currently running action continues in the background, and an attempt to add the host again may fail because of the conflict.
Metal3/Rancher Turtles/CAPI pods issues: The deployment flow for all the required components is:
- The Rancher Turtles controller deploys the CAPI operator controller.
- The CAPI operator controller then deploys the provider controllers (CAPI core, CAPM3 and RKE2 controlplane/bootstrap).

Verify all the pods are running correctly and check the logs otherwise.

Logs #

Metal³ logs:Check logs for the different pods.
```
kubectl logs -n metal3-system -l app.kubernetes.io/component=baremetal-operator
kubectl logs -n metal3-system -l app.kubernetes.io/component=ironic
```
Note
The metal3-ironic pod contains at least 4 different containers (ironic-httpd,` ironic-log-watch`, ironic & ironic-ipa-downloader (init)) on the same pod. Use the -c flag when using kubectl logs to verify the logs of each of the containers.
Note
The ironic-log-watch container exposes console logs from the hosts after inspection/provisioning, provided network connectivity enables sending these logs back to the management cluster. This can be useful in cases where there are provisioning errors but you do not have direct access to the BMC console logs.

Rancher Turtles logs: Check logs for the different pods.

kubectl logs -n rancher-turtles-system -l control-plane=controller-manager
kubectl logs -n rancher-turtles-system -l app.kubernetes.io/name=cluster-api-operator
kubectl logs -n rke2-bootstrap-system -l cluster.x-k8s.io/provider=bootstrap-rke2
kubectl logs -n rke2-control-plane-system -l cluster.x-k8s.io/provider=control-plane-rke2
kubectl logs -n capi-system -l cluster.x-k8s.io/provider=cluster-api
kubectl logs -n capm3-system -l cluster.x-k8s.io/provider=infrastructure-metal3

BMC logs: Usually BMCs have a UI where most of the interaction can be done. There is usually a “logs” section that can be observed for potential issues (not being able to reach the image, hardware failures, etc.).
Console logs: Connect to the BMC console (via the BMC webui, serial, etc.) and check for errors on the logs being written.

Troubleshooting steps #

Check BareMetalHost status:
- kubectl get bmh -A shows the current state. Look for provisioning, ready, error, registering.
- kubectl describe bmh -n <namespace> <bmh_name> provides detailed events and conditions explaining why a BMH might be stuck.
Test RedFish connectivity:
- Use curl from the Metal³ control plane to test connectivity to the BMCs via redfish.
- Ensure correct BMC credentials are provided in the BareMetalHost-Secret definition.
Verify turtles/CAPI/metal3 pod status: Ensure the containers on the management cluster are up and running: kubectl get pods -n metal3-system and kubectl get pods -n rancher-turtles-system (also see capi-system, capm3-system, rke2-bootstrap-system and rke2-control-plane-system).
Verify the ironic endpoint is reachable from the host being provisioned: The host being provisioned needs to be able to reach out the Ironic endpoint to report back to Metal³. Check the IP with kubectl get svc -n metal3-system metal3-metal3-ironic and try to reach it via curl/nc.
Verify the IPA image is reachable from the BMC: IPA is being served by the Ironic endpoint and it needs to be reachable from the BMC as it is being used as a virtual CD.
Verify the OS image is reachable from the host being provisioned: The image being used to provision the host needs to be reachable from the host itself (when running IPA) as it will be downloaded temporarily and written to the disk.
Examine Metal³ component logs: See above.
Retrigger BMH Insepction: If an inspection failed or the hardware of an available host changed, a new inspection process can be triggered by annotating the BMH object with inspect.metal3.io: "". See the Metal³ Controlling inspection guide for more information.
Bare metal IPA console: To troubleshoot IPA issues a couple of alternatives exist:
- Enable “autologin”. This enables the root user to be logged automatically when connecting to the IPA console.
  Warning
  This is only for debug purposes as it gives full access to the host.
  To enable autologin, the Metal3 helm global.ironicKernelParams value should look like: console=ttyS0 suse.autologin=ttyS0 (depending on the console, ttyS0 can be changed). Then a redeployment of the Metal³ chart should be performed. (Note ttyS0 is an example, this should match the actual terminal e.g may be tty1 in many cases on bare metal, this can be verified by looking at the console output from the IPA ramdisk on boot where /etc/issue prints the console name).
  Another way to do it is by changing the IRONIC_KERNEL_PARAMS parameter on the ironic-bmo configmap on the metal3-system namespace. This can be easier as it can be done via kubectl edit but it will be overwritten when updating the chart. Then the Metal³ pod needs to be restarted with kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic.
- Inject an ssh key for the root user on the IPA.
  Warning
  This is only for debug purposes as it gives full access to the host.
  To inject the ssh key for the root user, the Metal³ helm debug.ironicRamdiskSshKey value should be used. Then a redeployment of the Metal³ chart should be performed.
  Another way to do it is by changing the IRONIC_RAMDISK_SSH_KEY parameter on the ironic-bmo configmap on the metal3-system namespace. This can be easier as it can be done via kubectl edit but it will be overwritten when updating the chart. Then the Metal³ pod needs to be restarted with kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic

Note

Check the CAPI troubleshooting and Metal³ troubleshooting guides.