Why You Can't Schedule to Host NUMA Nodes in Nova?

If I had a euro for every time someone had asked me or someone else working on nova for the ability to schedule an instance to a specific host NUMA node, I might never have to leave the pub (back in halcyon days pre-COVID-19 when pubs were still a thing, that is).

Below is an edited version of a response one of my friends and colleagues, Sean Mooney, provided to a Red Hat partner asking just this question recently. This information is accurate as of the OpenStack 21.0.0 (Ussuri) release but is subject to change in future releases.

What’s wrong with choosing a host NUMA node?

In almost all cases when we discuss the motivation for this request with people, we discover that selecting host CPUs or NUMA nodes via the flavor is not actually what they wanted to do. Rather, it is seen as a means to an end recommended by people familiar with virtualisation technologies but not with cloud platforms. There are a number of reasons this is not considered an acceptable solution in a cloud context. To summarize:

It is seen as a potential security concern for public clouds.

To correctly understand which flavor to use when flavors can map to host resources like CPUs or NUMA nodes would require knowledge of the underlying hardware. This information can be used by a malicious user as a DDOS vector as they could intentionally place their instance on the same NUMA node (opening the opportunity to exhaust memory bandwidth of a NUMA node) and host as their victim. It also exposes information about what hardware a cloud is running indirectly that many clouds would prefer not to share.
It is a violation of the cloud abstraction.

OpenStack is a cloud platform intended to provide a consistent API across multiple backend implementations of services. In fact, cloud abstraction is a key element of the nova project scope. Virtual NUMA topologies are supported by two main drivers today, Libvirt and HyperV, and while the Libvirt driver and Libvirt in general are much more flexible than other virt drivers in many aspects, nova does strive to keep the differences to a minimum. If we were to encode the semantics of how virtual resources map to physical resources, we would exclude the possibility of achieving interoperability between different drivers or different clouds as well as create barriers to adopting features across multiple drivers.
It makes our clouds much smaller and less useful.

Encoding host specific resource assignment information severely limits the available hosts that can be used to create an instance. It complicates move operations and makes the maintainability and extensibility of the scheduler harder over time. The operational overhead of having to create different flavors for a VM that runs on node 5 vs node 6 and the additional cognitive load that this puts on the end user is a failure in API design. Instead of expressing an abstract policy, the end user and operator needs to understand the intricate mechanics of how the workload compute context is created.

There are additional reasons to those listed above, but these alone should serve to illustrate that this is not oversight in how nova currently functions but rather a very deliberate design choice that we do not want to remove.

But we really need this, so why can’t you special case it for us?

If you were willing to ignore all of the above, and the many other reasons for not doing this, you might tempted to think that you could maybe hack this in anyway. After all, extra specs have long been one of the untamed corners of nova (at least, they were until Ussuri). Not so fast. Long-term, it’s unlikely that this will even be an option for reasons to do with how we’ve handled scheduling in nova in the past and how we’re planning to evolve it in the future.

As you may or may not be aware, there has been a multi-year effort to evolve the tracking of resources in OpenStack. This effort predates the creation of the placement service but was the primary reason for its existence. Nova has a multi-layer scheduling approach, the first step of which is delegated to the placement service, followed by filtering and then weighing. Following scheduling (that is, the act of selecting a host) there is then a resource assignment phase that is performed by the virt driver.

While the scheduling and resource assignment steps are heavily linked, they are independent operations. To put this in concrete terms, while the NUMA topology filter has input into selecting a host, by determining whether a given NUMA topology can be created on a given host, it has no input into selecting which NUMA nodes on the host will be used for the VM. That decision is made entirely by the virt driver during the assignment phase. The three phases of nova scheduling today are summarised as follows:

The placement service is queried for a set of allocation candidates (potential hosts your instance can be scheduled to) that represent resource allocation on hosts that can fulfil the quantitative and qualitative requirement of the resources requested for an instance.
The hosts represented by those allocation candidates are filtered based primarily on non-resource related attributes such as server group affinity or anti-affinity constraints and resources and topologies that are not yet modeled in placement, such as PCI devices and hugepages and various NUMA affinity metrics.
The filtered hosts are weighed to select an optimal host.

Up until now, placement, which has a global hierarchical view of resources from multiple services, has largely operated on the basis of tracking a simple tally count of resources on a given host. When it was originally created, placement modeled each compute host as a single resource provider containing multiple inventories of different resources classes. Each resource provider can only have one instance of an inventory per resource class, so when memory is represented as MEMORY_MB it tracks the total available system memory on the host. Placement also tracks allocation from resource providers to instances so it can know how much of each resource is still available. It can then use this capacity information in addition to some qualitative traits (e.g. this host has SSD storage while this host does not) to produce a set of potential allocation candidates.

As we have evolved from this simple view, we have extended the placement API to support nested resource providers, allowing us to convert a flat view of a host’s resources into a tree data structure. There were many reasons for this effort, but the chief motivator was that it allows us to begin modelling of NUMA topologies in placement. This will allow us to track hugepages as inventory in placement as well as consider NUMA affinity for things like vGPUs (note: we also plan to model generic PCI devices in placement, though this effort will likely not begin until the NUMA-in-placement effort is complete). There are two implications of this work that would prevent a “boot on NUMA node N” feature in the future.

Firstly, doing this work will allow us to remove the NUMA topology filter in the future, increasing scheduler performance among other things but you do not get this performance for free. In the case of placement, the trade-off is in the freedom to select resources on the host during the assignment phase. In order to maintain placement as the single source or truth with regard to capacity and availability of resources, it is vitally important that, when a resource class on a host exists on multiple separate providers, the virt driver will only allocate resources from the hardware corresponding to the resource provider chosen during scheduling. For example, memory tracked per NUMA node or vGPUs tracked per physical GPU must be assigned from the NUMA node or pGPU that the allocation is made against in placement. This means that in a world where all resources are tracked in placement, the assignment done by a virt driver is constrained to only looking at a subset of hardware that correlates with the placement allocation and it will no longer be possible to, for example, ensure that a given instance is pinned to CPUs from a specific NUMA node.

Secondly, it’s important to realize that the resource topology reported to placement will be determined by the virt driver that reports them. For example, ironic hosts are represented using a single custom resource class rather than reporting VCPU, MEMORY_MB and DISK_GB inventories in order to model their consumption as a single unit. This means that at the time of making the placement query, since we have not yet selected a host we also have not selected a compute context (hypervisor, baremetal/composable server or container runtime) and as such cannot make assumptions about how the virt driver that manages that host will model its resource in placement. Nova supports using multiple compute contexts concurrently and it’s not uncommon for operators to use image-based filters to map Windows images to HyperV hosts and Linux images to VMWare or Libvirt hosts. Similarly, in a multi-architecture cloud, they may offer PowerVM hosts. Since each of these compute contexts may support assigning host resources to instances in different ways we cannot make it an API requirement to pin the instance to NUMA node N based on a flavor extra spec.

Taken together, the continued effort to move tracking of all resource to placement means any effort to map a given instance to a specific host NUMA node is dead on arrival. We can achieve what’s necessary, but it can and should be done in a better way.