I’ve recently found myself once again working on the OpenStack Cinder CSI Driver and the Operator that OpenShift uses to deploy this. This work has inspired me to improve my knowledge of how the Cinder CSI Driver - and CSI drivers in general - work. Below is my current high-level understanding of both as well as a quick summary of changes we are making to the Cinder CSI Driver Operator in OpenShift 4.19.
Deployment of the Cinder CSI Driver
The Cinder CSI Driver Operator deploys the driver itself as two components: a controller component and a per-node
component, which is the typical deployment model for CSI Drivers. The controller component is managed
via a Deployment which you can see here. It consists of the controller plugin and a number of sidecar
containers which interface between the controller and the Kubernetes controller manager (kube-controller-manager
) via
a Unix domain socket and handle different RPC calls. Breaking these down one-by-one:
-
The controller plugin container (
csi-driver
) implements the Controller Service and Identity Service set of RPCs described in the CSI spec. It is responsible for handling requests by calling the cloud provider’s APIs (Cinder and Nova, this case).You can find the Cinder CSI implementation of the Controller Service here.
-
The attacher sidecar container (
csi-attacher
) watches for attach and detach calls and callsControllerPublishVolume
andControllerUnpublishVolume
, respectively. (source). -
The provisioner sidecar container (
csi-provisioner
) watches for PVC creation and deletion and callsCreateVolume
andDeleteVolume
, respectively. (source). -
The snapshotter sidecar container (
csi-snapshotter
) does the same as the provisioner but for snapshots, callingCreateSnapshot
andDeleteSnapshot
. (source). -
The resizer sidecar container (
csi-resizer
) watches for changes to a PVC and callsControllerExpandVolume
as necessary. (source).
The per-node component, by comparison, is deployed to each node using a DaemonSet. You can see the definition for this here. It consists of the node plugin and a single sidecar container:
-
The node plugin container (
csi-driver
) implements the Node Service and Identity Service sets of RPCs described in the CSI spec. It is responsible for reporting information about the node and for bind mounting volumes once they are attached to the host. Specifically, it reports an ID of the node, the maximum number of volumes it supports, and topology information. In the case of Cinder, both the ID and topology information are sourced from the metadata service, while the volume limit is determined via a configuration option.You can find the Cinder CSI implementation of the Node Service here.
-
The node-driver-registrar sidecar container (
csi-node-driver-registrar
) registers the CSI driver with kubelet, allowing kubelet to callNodeGetInfo
,NodeStageVolume
,NodePublishVolume
etc. (source).
Changes to topology auto-configuration
Now that we understand the various components that make up the CSI Driver, let’s take a look at the changes we’ve been
working on in this area. As I’ve previously discussed, the Cinder CSI Driver has support for
Availability Zones (or, in CSI parlance, the CSI Topology Feature) and since OpenShift 4.16 or so the Cinder CSI Driver
Operator has supported auto-configuration of this feature. Without getting too into the weeds, the way this is
determined is via a simple set comparison: the set of Compute AZs is compared to the set of Block Storage AZs, and if
the former isn’t a subset of the latter (e.g. if there was a Compute AZ called foo
but no equivalent Block Storage AZ
of the same name) then we determine that the feature should be disabled. Once we’ve determined this, we toggle the
Topology
feature gate of the CSI Provisioner sidecar container, thus ensuring that the AccessibilityRequirements
field of the CreateVolumeRequest
struct generated by the provisioner (and fed to the controller plugin) would not be
populated.
However, things change and the Topology feature is now considered mature and is enabled by default. This means it is
likely that the feature flag will be removed at some point in the not-too-distant future, which in turn means we need to
find another way to enable and disable topology support from the operator. The solution we’ve arrived at is to copy what
was done in Manila and add support for a new --with-topology
option to both the controller plugin and node plugin
services. This new option has different effects depending on where it is set:
-
For the controller plugin, the option determines whether (a) the calls to Cinder include a requested AZ and (b) whether the
CreateVolumeResponse
returned by theCreateVolume
call includes topology accessibility information. -
For the node plugin, the option determines whether the node reports (a) the capability (as part of the
GetPluginCapabilities
RPC) and (b) a topology information (as part of theNodeGetInfo
call).
This work has been implemented in kubernetes/cloud-provider-openstack#2743 (with some follow-ups in
kubernetes/cloud-provider-openstack#2862 and kubernetes/cloud-provider-openstack#2865). With
the new option in place, we’ve been able to change how the Operator toggles the Topology feature. Now, instead of
enabling and disabling the feature gate on the csi-provisioner
container, it can enable and disable the feature on the
csi-driver
containers in the controller deployment and node daemonsets. That work has been implemented in
openshift/csi-operator#345.
Next steps
I’m hoping this is last time I feel the need to write about the Cinder CSI Driver and its Operator. The work we’ve done here should future proof both and ensure that, barring major changes to the CSI Spec itself, few other changes will be needed for the foreseeable. I would however like to get a better understanding of how the equivalent feature in the Manila CSI Driver works, so watch our for a possible post on that topic down the line.