Debugging Failed OpenShift-on-OpenStack Deployments

I deploy OpenShift-on-OpenStack quite regularly these days. Some times these deployments fail and the most common failure I usually see is a timeout during bootstrapping.

$ openshift-install --log-level debug create cluster
DEBUG OpenShift Installer 4.15.10
DEBUG Built from commit 24a827900e76d8f9c79122307415b47a4921bbd7
DEBUG Fetching Metadata...
...
DEBUG Reusing previously-fetched Install Config
INFO Skipping VM console logs gather: no gather methods registered for "openstack"
INFO Pulling debug logs from the bootstrap machine
DEBUG Using SSH_AUTH_SOCK /run/user/1000/keyring/ssh to connect to an existing agent
ERROR Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 10.0.212.9:22: connect: connection timed out
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.stephenfin.shiftstack-demo.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 10.0.214.50:6443: i/o timeout
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

You’ve a couple of tools that you can use to validate this. The first of these is to check the serial console. This will highlight the more egregious issues with your deployment. You can do this with:

$ openstack console url show stephenfin-5ps6d-bootstrap  # replace with your own bootstrap server's name

If this doesn’t show anything weird then the next step is to log in to the server and check the status of the bootkube service. As is custom with OpenStack, to SSH into a machine you need (a) a floating IP and (b) a security group (or more accurately a security group rule) that allows SSH access. The Installer automatically assigns a floating IP to the bootstrap machine so (a) is taken care of. That leaves (b). You like already have an “allow SSH” security group lying around and if so, you can use that now:

$ openstack server add security group stephenfin-5ps6d-bootstrap allow_ssh  # replace with your own server, SG names

Once you’ve allowed SSH traffic you can SSH into the machine.

$ openstack server ssh stephenfin-5ps6d-bootstrap -- -l core

From here you can follow the directions given in the MOTD and check the bootkube service first:

$ journalctl -b -f -u release-image.service -u bootkube.service

In my case it appeared the issue was the lack of access to the master nodes:

s the base image from which all OpenShift Container Platform images inherit.)
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Check if API and API-Int URLs are reachable during bootstrap
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Checking if api.stephenfin.shiftstack-demo.com of type API_URL reachable
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Unable to reach API_URL's https endpoint at https://api.stephenfin.shiftstack-demo.com:6443/version
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Unable to validate. https://api.stephenfin.shiftstack-demo.com:6443/version is currently unreachable.
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Checking if api-int.stephenfin.shiftstack-demo.com of type API_INT_URL reachable
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Unable to reach API_INT_URL's https endpoint at https://api-int.stephenfin.shiftstack-demo.com:6443/version
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: Unable to validate. https://api-int.stephenfin.shiftstack-demo.com:6443/version is currently unreachable.
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap bootkube.sh[2449]: bootkube.service complete
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap systemd[1]: bootkube.service: Deactivated successfully.
Apr 22 14:01:09 stephenfin-5ps6d-bootstrap systemd[1]: bootkube.service: Consumed 1min 2.337s CPU time.

The same steps apply for debugging issues with master or worker nodes: add a floating IP, allow SSH access, then SSH into the machine.

$ openstack server add floating ip stephenfin-5ps6d-master-0 10.0.214.101
$ openstack server add security group stephenfin-5ps6d-master-0 allow_ssh
$ openstack server ssh stephenfin-5ps6d-master-0 -- -l core
comments powered by Disqus