After seeing a few too many availability zone-related issues popping up in OpenShift clusters of late, I’ve decided it might make sense to document the situation with OpenStack AZs on OpenShift (and, by extension, Kubernetes). This is the second part of two. The first part provided some background on what AZs are and how you can configure them, while this part will examine how AZs affect OpenShift and Kubernetes components such as the OpenStack Machine API Provider, the OpenStack Cluster API Provider, and the Cinder and Manila CSI drivers.
The line up
There’s a couple of OpenStack-specific components we need to be aware of in a typical OpenShift-on-OpenStack (a.k.a. ShiftStack) deployment. My former colleague Michał Dulko provided a good overview of many of these on his blog but to (re-)summarise, you’ve got:
- Cloud Provider OpenStack (CPO)
- Machine API Provider OpenStack (MAPO)
- Cluster API Provider OpenStack (CAPO)
In addition to these three components, there are two others to consider:
- Cinder CSI Driver
- Manila CSI Driver
Today we’re going to take a look at three of these five components - CPO, MAPO, and the Cinder CSI Driver - and explore how availability zones - both Compute and Block Storage - impact them.
Cloud Provider OpenStack
In contrast to a favoured programming language of mine 🐍, Kubernetes operates on very much batteries not included
model. It does not provide out-of-the-box support for such important things as block storage, networking, or ingress. To
resolve this, you normally run Kubernetes on top of another platform - be that AWS, vSphere, GCE, or in our case
OpenStack - and add additional components that provide integration between your Kubernetes cluster and said platform and
its APIs. The cloud-provider interface provides one part of the integration puzzle here, managing the lifecycle of
Node
s (including their removal from the cluster if the underlying instance is deleted), Service
s of type
LoadBalancer
, and routes.
The OpenStack CCM uses Compute AZ information for the underlying instance to label the corresponding Node
. It sets two
labels, the topology.kubernetes.io/zone
label and the legacy
failure-domain.beta.kubernetes.io/zone
label. You can see this if you retrieve the labels
for a Node
:
❯ oc get Node -o jsonpath='{.items[*].metadata.labels}' | jq
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "ci.m1.xlarge",
"beta.kubernetes.io/os": "linux",
"failure-domain.beta.kubernetes.io/region": "regionOne",
"failure-domain.beta.kubernetes.io/zone": "nova", # <--- !!! here !!!
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "stephenfin-5ps6d-master-0",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/control-plane": "",
"node-role.kubernetes.io/master": "",
"node.kubernetes.io/instance-type": "ci.m1.xlarge",
"node.openshift.io/os_id": "rhcos",
"topology.cinder.csi.openstack.org/zone": "nova",
"topology.kubernetes.io/region": "regionOne",
"topology.kubernetes.io/zone": "nova" # <--- !!! and here !!!
}
To fetch the AZ information, OpenStack CCM queries the Nova API. You can see that happening
here,
and publishes this information via the GetZoneByProviderID
and GetZoneByName
functions, which form part of the
cloud-provider API.
Because the labels are defined on Node
s, they are useful for controlling the scheduling of pods, allowing users to
spread pods across multiple AZs. This is discussed in more details in the Kubernetes docs, in Assigning Pods to
Nodes and Pod Topology Spread
Constraints for example. They’re
also used for other topology-related features, such as Topology Aware
Routing.
With regards to configuring these labels, there’s currently nothing to stop you modifying the labels in place. You
shouldn’t do this, however, since it doesn’t change the AZ of the underlying instance and pretty much kills whatever
advantage the topology feature has. If you want to change the AZ of the Node
then you’ll need to either migrate it (an
operation that comes with its own issues) or recreate it.
Machine API Provider OpenStack
The Machine API Provider OpenStack, or MAPO, is a Machine API provider for the OpenStack platform. The Machine API
allows you to scale up or scale down your cluster based on workload policies or other preferences and functions quite
similarly to the Cluster API, albeit with a different API. You can create Machine
s manually, but its more common to
instead create or modify MachineSet
s (for workers) or ControlPlaneMachineSet
s (for masters). You can find more
information about the Machine API in the OpenShift
documentation.
Like OpenStack CCM, MAPO uses AZ information to label resources - this time setting the machine.openshift.io/zone
label on Machine
resources:
❯ oc get -A Machine -o jsonpath='{.items[*].metadata.labels}' | jq
{
"machine.openshift.io/cluster-api-cluster": "stephenfin-5ps6d",
"machine.openshift.io/cluster-api-machine-role": "master",
"machine.openshift.io/cluster-api-machine-type": "master",
"machine.openshift.io/instance-type": "ci.m1.xlarge",
"machine.openshift.io/region": "regionOne",
"machine.openshift.io/zone": "nova" # <-- !!! here !!
}
Also like OpenStack CCM, MAPO sources this AZ information from the Nova API, as you can see
here
(instanceStatus
is a thin wrapper around a ServerExt
resource used by Gophercloud). They’re also editable but again,
editing them won’t actually change the AZ on the underlying instance and you’ll need to make changes elsewhere to do
this. However, unlike with Node
s, you can configure the AZ of a new or existing Machine
or MachineSet
/
ControlPlaneMachineSet
as part of the object definition and this change will get reflected in the labels, both of
the Machine
and of the Node
. For example, to define a AZ when creating a new MachineSet
:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
name: <infrastructure_id>-<role>
namespace: openshift-machine-api
# ...
spec:
# ...
template:
spec:
providerSpec:
value:
# ...
availabilityZone: nova-az0
rootVolume:
# ...
availabilityZone: cinder-az0
# ...
Alternatively, to define one for a ControlPlaneMachineSet
:
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
name: cluster
namespace: openshift-machine-api
spec:
# ...
template:
# ...
machines_v1beta1_machine_openshift_io:
failureDomains:
platform: OpenStack
openstack:
- availabilityZone: nova-az0
rootVolume:
availabilityZone: cinder-az0
- availabilityZone: nova-az1
rootVolume:
availabilityZone: cinder-az1
- availabilityZone: nova-az2
rootVolume:
availabilityZone: cinder-az2
# ...
Cinder CSI Driver
The last component we’re going to look at here is the Cinder CSI Driver. The Container Storage Interface (CSI) defines a
standardised way to expose arbitrary block and file storage systems to Kubernetes workloads, allowing us to plug in
storage backends for various cloud platforms or networked storage solutions like NFS or SMB. As you might suspect, the
Cinder CSI Driver allows us to plug in storage from the OpenStack Block Storage service, Cinder, and to create
PersistentVolume
s or PersistentVolumeClaim
s that correspond to Cinder volumes.
Once again, the Cinder CSI driver uses AZ information to label resources and once again it’s the Node
s that get the
resulting label. The Cinder CSI driver sets a single label on a node, the topology.cinder.csi.openstack.org/zone
label:
❯ oc get Node -o jsonpath='{.items[*].metadata.labels}' | jq
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "ci.m1.xlarge",
"beta.kubernetes.io/os": "linux",
"failure-domain.beta.kubernetes.io/region": "regionOne",
"failure-domain.beta.kubernetes.io/zone": "nova",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "stephenfin-5ps6d-master-0",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/control-plane": "",
"node-role.kubernetes.io/master": "",
"node.kubernetes.io/instance-type": "ci.m1.xlarge",
"node.openshift.io/os_id": "rhcos",
"topology.cinder.csi.openstack.org/zone": "nova", # <--- !!! here !!!
"topology.kubernetes.io/region": "regionOne",
"topology.kubernetes.io/zone": "nova"
}
Unlike the other two components we’ve talked about though, the Cinder CSI Driver doesn’t fetch AZ information from the
API. Instead, it fetches it from the Metadata API, as you can see
here.
That NodeGetInfo
function forms part of the CSI spec and is used by kubelet
, as detailed in the Kubernetes
documentation.
Now this use of the Metadata service is somewhat of an usual choice: the Metadata service is provided by Nova and the AZ information it exposes is the AZ of the instance (derived from the host the instance is scheduled to). It’s therefore a Compute AZ so why is being used for a storage-related component? The answer is that we’re using it because there’s nothing else to use: as described in part 1, OpenStack doesn’t provide any mechanism to associate compute hosts with storage AZs outside of using the same naming scheme across both compute and block storage AZs. In my experience, this loose coupling is a very frequent source of bugs. To use the topology feature (which we’ll go into more detail on shortly), you really need to have a common set of AZs for both the Compute and Block Storage services. Until OCP 4.15, the OpenStack Cinder CSI Driver Operator (i.e. the operator that deploys and manages the lifecycle of the Cinder CSI Driver in an OpenShift deployment) assumed this to be the case and always enabled the topology feature. This has since changed but if you’re running an older release then you’re likely to encounter this issue if e.g. using multiple Nova AZs and single Cinder AZ.
Making things even more complicated, migration of the Nova instance corresponding to a Node
can result in an instance
moving between AZs, assuming the Nova instance in question was not created in a specific AZ initially. The CSI driver
will detect this change and will attempt to update the labels on the Node
, resulting in the following rather nasty
error that will require manual intervention to solve.
Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: error updating Node object with CSI driver node info: error updating node: timed out waiting for the condition; caused by: detected topology value collision: driver reported "topology.cinder.csi.openstack.org/zone":"nova" but existing label is "topology.cinder.csi.openstack.org/zone":"nova-az3",}
Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: error updating Node object with CSI driver node info: error updating node: timed out waiting for the condition; caused by: detected topology value collision: driver reported "topology.cinder.csi.openstack.org/zone":"nova" but existing label is "topology.cinder.csi.openstack.org/zone":"nova-az3", restarting registration container.
However, assuming we know about these issues and work to avoid them, how does one actually use the topology features
provided by the Cinder CSI driver? There are actually two ways. The first is to configure a topology-aware
StorageClass
and use this for a PersistentVolumeClaim
, as seen in the examples for the Cinder CSI
Driver.
For example:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topology-aware-standard
provisioner: cinder.csi.openstack.org
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: topology.cinder.csi.openstack.org/zone
values:
- az1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: demo-pvc-with-az
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: topology-aware-standard
By using this storage class, we ensure that the Cinder volume created for the PVC will request an availability zone of
az1
. This is the standard mechanism promoted by Kubernetes and is also supported by other non-OpenStack CSI drivers
that provide topology support.
The other mechanism is to specify a Cinder CSI driver-specific parameter, availability
, when creating the storage
class. This is effectively a legacy option that pre-dates topology support in CSI but it’s still available:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: legacy-az
provisioner: cinder.csi.openstack.org
volumeBindingMode: WaitForFirstConsumer
parameters:
availability: az1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: demo-pvc-with-az
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: legacy-az
With regards to configuring these labels there’s currently nothing to stop you modifying the labels in place, just like
the labels OpenStack CCM sets on Node
s, and like that you don’t really want to do this since the Cinder CSI driver
effectively owns them. The only time you may wish to do this is if you’ve migrated your Nova instance and hit the issue
described above. In this case, you can opt to manually relabel the Node
, taking care to drain any workload from it
first to avoid topology mismatches.
Wrap up
As so concludes this two part series looking at Availability Zones in OpenStack and OpenShift. As you’ve hopefully
ascertained, they can be a very useful feature, particularly in larger deployments, but there are more than a few
potential banana skins to be aware of when you start using them for storage. By way of recommendations, I would suggest
either using a single common AZ for you deployment (you can stick to the default of nova
) or a common set of AZs
across both the compute and block storage hosts (e.g. in a two-AZ deployment, you could use az0
and az1
for both
compute and block storage AZs, rather than nova-az0
and nova-az1
for compute AZs and cinder-az0
and cinder-az1
for block storage AZs). If you insist on sticking with divergent sets of AZs, you should disable the topology feature
and rely on the legacy availability
parameter of the StorageClass
instead.
I hope this has been useful to someone. If you spot any mistakes or identify things that I should have covered but didn’t, feel free to send me an email and I’ll try get things sorted.