Availability Zones in Openstack and Openshift (Part 2)

After seeing a few too many availability zone-related issues popping up in OpenShift clusters of late, I’ve decided it might make sense to document the situation with OpenStack AZs on OpenShift (and, by extension, Kubernetes). This is the second part of two. The first part provided some background on what AZs are and how you can configure them, while this part will examine how AZs affect OpenShift and Kubernetes components such as the OpenStack Machine API Provider, the OpenStack Cluster API Provider, and the Cinder and Manila CSI drivers.

The line up

There’s a couple of OpenStack-specific components we need to be aware of in a typical OpenShift-on-OpenStack (a.k.a. ShiftStack) deployment. My former colleague Michał Dulko provided a good overview of many of these on his blog but to (re-)summarise, you’ve got:

Cloud Provider OpenStack (CPO)
Machine API Provider OpenStack (MAPO)
Cluster API Provider OpenStack (CAPO)

In addition to these three components, there are two others to consider:

Cinder CSI Driver
Manila CSI Driver

Today we’re going to take a look at three of these five components - CPO, MAPO, and the Cinder CSI Driver - and explore how availability zones - both Compute and Block Storage - impact them.

Cloud Provider OpenStack

In contrast to a favoured programming language of mine 🐍, Kubernetes operates on very much batteries not included model. It does not provide out-of-the-box support for such important things as block storage, networking, or ingress. To resolve this, you normally run Kubernetes on top of another platform - be that AWS, vSphere, GCE, or in our case OpenStack - and add additional components that provide integration between your Kubernetes cluster and said platform and its APIs. The cloud-provider interface provides one part of the integration puzzle here, managing the lifecycle of Nodes (including their removal from the cluster if the underlying instance is deleted), Services of type LoadBalancer, and routes.

The OpenStack CCM uses Compute AZ information for the underlying instance to label the corresponding Node. It sets two labels, the topology.kubernetes.io/zone label and the legacy failure-domain.beta.kubernetes.io/zone label. You can see this if you retrieve the labels for a Node:

❯ oc get Node -o jsonpath='{.items[*].metadata.labels}' | jq
{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/instance-type": "ci.m1.xlarge",
  "beta.kubernetes.io/os": "linux",
  "failure-domain.beta.kubernetes.io/region": "regionOne",
  "failure-domain.beta.kubernetes.io/zone": "nova",  # <--- !!! here !!!
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "stephenfin-5ps6d-master-0",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": "",
  "node-role.kubernetes.io/master": "",
  "node.kubernetes.io/instance-type": "ci.m1.xlarge",
  "node.openshift.io/os_id": "rhcos",
  "topology.cinder.csi.openstack.org/zone": "nova",
  "topology.kubernetes.io/region": "regionOne",
  "topology.kubernetes.io/zone": "nova"    # <--- !!! and here !!!
}

To fetch the AZ information, OpenStack CCM queries the Nova API. You can see that happening here, and publishes this information via the GetZoneByProviderID and GetZoneByName functions, which form part of the cloud-provider API.

Because the labels are defined on Nodes, they are useful for controlling the scheduling of pods, allowing users to spread pods across multiple AZs. This is discussed in more details in the Kubernetes docs, in Assigning Pods to Nodes and Pod Topology Spread Constraints for example. They’re also used for other topology-related features, such as Topology Aware Routing.

With regards to configuring these labels, there’s currently nothing to stop you modifying the labels in place. You shouldn’t do this, however, since it doesn’t change the AZ of the underlying instance and pretty much kills whatever advantage the topology feature has. If you want to change the AZ of the Node then you’ll need to either migrate it (an operation that comes with its own issues) or recreate it.

Machine API Provider OpenStack

The Machine API Provider OpenStack, or MAPO, is a Machine API provider for the OpenStack platform. The Machine API allows you to scale up or scale down your cluster based on workload policies or other preferences and functions quite similarly to the Cluster API, albeit with a different API. You can create Machines manually, but its more common to instead create or modify MachineSets (for workers) or ControlPlaneMachineSets (for masters). You can find more information about the Machine API in the OpenShift documentation.

Like OpenStack CCM, MAPO uses AZ information to label resources - this time setting the machine.openshift.io/zone label on Machine resources:

❯ oc get -A Machine -o jsonpath='{.items[*].metadata.labels}' | jq
{
  "machine.openshift.io/cluster-api-cluster": "stephenfin-5ps6d",
  "machine.openshift.io/cluster-api-machine-role": "master",
  "machine.openshift.io/cluster-api-machine-type": "master",
  "machine.openshift.io/instance-type": "ci.m1.xlarge",
  "machine.openshift.io/region": "regionOne",
  "machine.openshift.io/zone": "nova"  # <-- !!! here !!
}

Also like OpenStack CCM, MAPO sources this AZ information from the Nova API, as you can see here (instanceStatus is a thin wrapper around a ServerExt resource used by Gophercloud). They’re also editable but again, editing them won’t actually change the AZ on the underlying instance and you’ll need to make changes elsewhere to do this. However, unlike with Nodes, you can configure the AZ of a new or existing Machine or MachineSet / ControlPlaneMachineSet as part of the object definition and this change will get reflected in the labels, both of the Machine and of the Node. For example, to define a AZ when creating a new MachineSet:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <infrastructure_id>-<role>
  namespace: openshift-machine-api
  # ...
spec:
  # ...
  template:
    spec:
      providerSpec:
        value:
          # ...
          availabilityZone: nova-az0
          rootVolume:
            # ...
            availabilityZone: cinder-az0
  # ...

Alternatively, to define one for a ControlPlaneMachineSet:

apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  name: cluster
  namespace: openshift-machine-api
spec:
  # ...
  template:
    # ...
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: OpenStack
        openstack:
        - availabilityZone: nova-az0
          rootVolume:
            availabilityZone: cinder-az0
        - availabilityZone: nova-az1
          rootVolume:
            availabilityZone: cinder-az1
        - availabilityZone: nova-az2
          rootVolume:
            availabilityZone: cinder-az2
  # ...

Cinder CSI Driver

The last component we’re going to look at here is the Cinder CSI Driver. The Container Storage Interface (CSI) defines a standardised way to expose arbitrary block and file storage systems to Kubernetes workloads, allowing us to plug in storage backends for various cloud platforms or networked storage solutions like NFS or SMB. As you might suspect, the Cinder CSI Driver allows us to plug in storage from the OpenStack Block Storage service, Cinder, and to create PersistentVolumes or PersistentVolumeClaims that correspond to Cinder volumes.

Once again, the Cinder CSI driver uses AZ information to label resources and once again it’s the Nodes that get the resulting label. The Cinder CSI driver sets a single label on a node, the topology.cinder.csi.openstack.org/zone label:

❯ oc get Node -o jsonpath='{.items[*].metadata.labels}' | jq
{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/instance-type": "ci.m1.xlarge",
  "beta.kubernetes.io/os": "linux",
  "failure-domain.beta.kubernetes.io/region": "regionOne",
  "failure-domain.beta.kubernetes.io/zone": "nova",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "stephenfin-5ps6d-master-0",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": "",
  "node-role.kubernetes.io/master": "",
  "node.kubernetes.io/instance-type": "ci.m1.xlarge",
  "node.openshift.io/os_id": "rhcos",
  "topology.cinder.csi.openstack.org/zone": "nova",    # <--- !!! here !!!
  "topology.kubernetes.io/region": "regionOne",
  "topology.kubernetes.io/zone": "nova"
}

Unlike the other two components we’ve talked about though, the Cinder CSI Driver doesn’t fetch AZ information from the API. Instead, it fetches it from the Metadata API, as you can see here. That NodeGetInfo function forms part of the CSI spec and is used by kubelet, as detailed in the Kubernetes documentation.

Now this use of the Metadata service is somewhat of an usual choice: the Metadata service is provided by Nova and the AZ information it exposes is the AZ of the instance (derived from the host the instance is scheduled to). It’s therefore a Compute AZ so why is being used for a storage-related component? The answer is that we’re using it because there’s nothing else to use: as described in part 1, OpenStack doesn’t provide any mechanism to associate compute hosts with storage AZs outside of using the same naming scheme across both compute and block storage AZs. In my experience, this loose coupling is a very frequent source of bugs. To use the topology feature (which we’ll go into more detail on shortly), you really need to have a common set of AZs for both the Compute and Block Storage services. Until OCP 4.15, the OpenStack Cinder CSI Driver Operator (i.e. the operator that deploys and manages the lifecycle of the Cinder CSI Driver in an OpenShift deployment) assumed this to be the case and always enabled the topology feature. This has since changed but if you’re running an older release then you’re likely to encounter this issue if e.g. using multiple Nova AZs and single Cinder AZ.

Making things even more complicated, migration of the Nova instance corresponding to a Node can result in an instance moving between AZs, assuming the Nova instance in question was not created in a specific AZ initially. The CSI driver will detect this change and will attempt to update the labels on the Node, resulting in the following rather nasty error that will require manual intervention to solve.

Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: error updating Node object with CSI driver node info: error updating node: timed out waiting for the condition; caused by: detected topology value collision: driver reported "topology.cinder.csi.openstack.org/zone":"nova" but existing label is "topology.cinder.csi.openstack.org/zone":"nova-az3",}
Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: error updating Node object with CSI driver node info: error updating node: timed out waiting for the condition; caused by: detected topology value collision: driver reported "topology.cinder.csi.openstack.org/zone":"nova" but existing label is "topology.cinder.csi.openstack.org/zone":"nova-az3", restarting registration container.

However, assuming we know about these issues and work to avoid them, how does one actually use the topology features provided by the Cinder CSI driver? There are actually two ways. The first is to configure a topology-aware StorageClass and use this for a PersistentVolumeClaim, as seen in the examples for the Cinder CSI Driver. For example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: topology-aware-standard
provisioner: cinder.csi.openstack.org
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: topology.cinder.csi.openstack.org/zone
    values:
    - az1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: demo-pvc-with-az
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: topology-aware-standard

By using this storage class, we ensure that the Cinder volume created for the PVC will request an availability zone of az1. This is the standard mechanism promoted by Kubernetes and is also supported by other non-OpenStack CSI drivers that provide topology support.

The other mechanism is to specify a Cinder CSI driver-specific parameter, availability, when creating the storage class. This is effectively a legacy option that pre-dates topology support in CSI but it’s still available:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: legacy-az
provisioner: cinder.csi.openstack.org
volumeBindingMode: WaitForFirstConsumer
parameters:
  availability: az1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: demo-pvc-with-az
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: legacy-az

With regards to configuring these labels there’s currently nothing to stop you modifying the labels in place, just like the labels OpenStack CCM sets on Nodes, and like that you don’t really want to do this since the Cinder CSI driver effectively owns them. The only time you may wish to do this is if you’ve migrated your Nova instance and hit the issue described above. In this case, you can opt to manually relabel the Node, taking care to drain any workload from it first to avoid topology mismatches.

Wrap up

As so concludes this two part series looking at Availability Zones in OpenStack and OpenShift. As you’ve hopefully ascertained, they can be a very useful feature, particularly in larger deployments, but there are more than a few potential banana skins to be aware of when you start using them for storage. By way of recommendations, I would suggest either using a single common AZ for you deployment (you can stick to the default of nova) or a common set of AZs across both the compute and block storage hosts (e.g. in a two-AZ deployment, you could use az0 and az1 for both compute and block storage AZs, rather than nova-az0 and nova-az1 for compute AZs and cinder-az0 and cinder-az1 for block storage AZs). If you insist on sticking with divergent sets of AZs, you should disable the topology feature and rely on the legacy availability parameter of the StorageClass instead.

I hope this has been useful to someone. If you spot any mistakes or identify things that I should have covered but didn’t, feel free to send me an email and I’ll try get things sorted.