Stretched / Multi Zone Cluster Considerations
At a recent client, the subject of Stretched / Multizone Clusters came up. The following provides a summary of what these are and some of the considerations that should be taken into account when using them.
A Stretched Cluster (as illustrated below) is defined as a cluster which has worker nodes distributed across multiple Availability Zones (AZ) within a single region thereby providing resilience in the event of a single Availability Zone failure.
Stretched / Multizone Clustering Limitations of Note:
Kubernetes assumes that the different zones are located close to each other in the network, and as a result they do not perform any zone-aware routing. As a result inter service traffic may cross zones (even if the same pods exist in the same zone) incurring additional latency and cost.
Whilst nodes are in multiple zones, by default kube-up currently builds a single master node. While services are highly available and can tolerate the loss of a zone, the control plane is located in a single zone. Where a highly available control plane is required this needs to be configured (refer to: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/).
Communications between pods on a cluster are not encrypted.
Pods will only be able to attached to Persistent Volumes in the same Availability Zone as themselves.
Kubernetes will distribute pods across nodes in multiple Availability Zone using its “SelectorSpreadPriority” which is based in best endeavours (not guaranteed).
The cluster is highly available in the event of a single AZ failure. All of the pods/containers deployed to the remaining AZs will continue to operate.
Efficient resource allocation as all Pods will share the single cluster resources which means there is higher overall utilisation.
It is easier to manage a single cluster as opposed to multiple clusters across AZs
The cluster itself remains a single point of failure (SPOF). It is not uncommon for Kubernetes upgrades to lead to cluster failures or requiring full restarts (refer to: https://www.alibabacloud.com/help/doc-detail/86497.htm).
Mitigations include: Deploy a second stretched cluster; Ensure any upgrades are tested in lower environments (and operated for a period of time) before applying to production.
Increased security risk where multiple services sharing a common stretched cluster. If one application running on the cluster is compromised and the attackers gain access to the underlying host they may be able to gain access to other services running on the cluster.
Mitigations over and above hardening include: separating sensitive workloads by either having dedicated clusters or by using node pools.
Pods may not be equally distributed across the AZs thereby making the failure of one AZ more impactful than another.
Mitigations include: Use Kubernetes constraints such as “podAffinity”, “podAntiAffinity”, “nodeAffinity”, etc; and, ensure that each service has at least one instance in each AZ.
Network latency may impact performance where AZs are not located near one another given that Kubernetes does not intelligently route inter service traffic such that it uses services from the same zone where available.
Mitigations include: Performance test to understand the impact if key services are being called from another AZ.
Misconfigurations or poor adoption of best practices such as configuring resource limits can affect all services hosted on the cluster. For example a pod/container could consume resources required by other services where limits are not applied and thereby causing failures.
Mitigations include: Ensure Kubernetes best practices are adhered to (refer to: https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/)