February 7, 2019 | Written by: Etai Lev Ran
Categorized: Compute Services
Share this post:
The current multicluster Istio status
There is a growing community interest in running workloads on multiple clusters to achieve better scaling, failure isolation, and application agility. Istio v1.0 supports some multicluster capabilities and new ones are added in v1.1.
This blog post highlights the current multicluster Istio status, helping interested people understand what capabilities exist and how they may be used. We also hope to receive feedback from the readers on whether current support is sufficient for their needs.
Before going into actual implementation details, let’s spend a minute on terminology. There was some lack of requirement clarity in mid-to-late 2018. The source of much of that had to do with the fact that “multicluster” meant different things to different people, in different contexts.
The terminology used in this blog will be as follows:
- Cloud: A provider. As such, the term will not be used.
- Cluster: A collection of Kubernetes nodes with shared API masters. While Istio also works with other cluster types, such as virtual machines, initial focus was on Kubernetes clusters.
- Network: A set of endpoints or service instances that are directly interconnected from a network perspective. That is, barring any security devices and policies, anyone can talk to anyone. How direct connectivity is achieved is out of scope. It may use a Virtual Private Cloud (VPC), a Virtual Private Network (VPN), or any kind of overlay.
- Mesh: A set of workloads under a common administrative control.
It is important to note that the terms and their order does not define a specific relation, such as containment.
In fact, we’ve seen quite a variety of combinations between meshes, clusters, and networks. For instance, in some clouds, network and cluster are directly related. Each cluster is assigned its own network and is independent of other networks in other clusters. Consequently, overlapping pod and service IP addresses can be assigned, making connectivity harder. Similarly, a Virtual Machine (VM) would typically run outside the Kubernetes cluster—it may belong to the same mesh and could be running on the same network (e.g., when attached to a shared VPC).
The expectation is that Istio multicluster would be capable of addressing all of these. Understandably, implementation priority would be (or should be) driven by common use cases, with more prevalent setups addressed earlier on.
At a high level, two common patterns or use cases emerged: single mesh and mesh federation. As the name implies, single mesh combines multiple clusters into one unit, managed by one Istio control plane. It can be implemented as one “physical” control plane or as a set of control planes all synchronized with replicated configuration. This would typically use additional tooling, driven by shared CI/CD pipelines and/or GitOps practices.
The mesh federation pattern keeps clusters separate as individual management domains. Connections between clusters would then be done selectively, exposing only a subset of the services to other clusters.
To understand some of the design and implementation trade-offs, we first need to step through what is actually involved in making a cross-cluster call. Once we break the high-level flow down into steps, it helps to understand what capabilities are needed to support that.
First, the calling workload (a.k.a. client) needs to resolve the remote workload’s name to a network endpoint. This would typically be done using DNS in Kubernetes, though other discovery systems (such as Consul) can also be used. To successfully resolve a name to an endpoint, the service must somehow be registered in the client’s local DNS server or registry.
Given a network endpoint, the client makes an outgoing call and sends a request on it. These are intercepted by the Envoy sidecar proxy using information it received from Pilot (which, in turn, received it from Galley). The connection and request are mapped to an upstream and a specific endpoint and then routed to the remote endpoint. Depending on network topology and security requirements, the client-side Envoy may connect directly to the remote endpoint, or the connection might need to be routed through Istio’s egress and/or ingress gateways.
The remote (or server-side) proxy accepts the connection and validates the identities using mutual TLS exchange. Implicit in this is the need for certificates to share a common root of trust, even when signed by different Citadels.
An access check may be needed, in which case identities from different clusters are sent to the Mixer and matched against the set policies. Once the response is returned, operational information, such as latency, path, return codes, etc., may be collected and logged. We may want to add cluster information to this data to allow determination of where calls originated and completed. Without having this context, it could be difficult to determine if the high latency observed is indicative of an issue (such as a failure or load problem) or it could be the result of making a long distance call to another cluster.
As a baseline, and before looking at the new 1.1 release features, let’s take a look at the 1.0 support.
Istio 1.0 multicluster support uses a single mesh design. It allows multiple clusters to be joined into the mesh under the caveat that all clusters are on one shared network. That is, IP addresses for all pods and services in all clusters are directly routable and do not conflict—IP addresses assigned in one cluster will not be concurrently reused in another.
This provides basic support and works “out of the box” in cases that satisfy the assumption (or limitation). Given sufficient resolve and configuration, 1.0 can be extended to support multiple networks (e.g., by adding VPNs and NATs) and a mesh federation design (e.g., by manually adding the relevant services and service entries in clusters).
Note that in order to enable name resolution and identities, one must ensure that namespaces, services, and service accounts are defined identically in all clusters. As previously mentioned, this could be automated to ensure conformance.
The following diagram shows the call sequence using the multicluster support in 1.0:
In this architecture:
- Cluster 1 runs the Istio control plane. It is the often called the “local” cluster, with all other clusters referred to as “remote” clusters. You can substitute “local” for “hub,” “master,” or “control plane” cluster if it makes things clearer.
- Other clusters, such as Cluster 2, have a smaller Istio footprint and run Citadel and admission controller for auto-injections in the control plane and sidecar proxies for workloads in the data plane.
- Pilot has access to all Kubernetes API masters, in all clusters, so it has a global mesh view. Citadel and auto-injection operate with cluster local scope.
- Each cluster has a unique Pod and Service CIDR, but other than that, there is a shared “flat” network between clusters. This allows direct routes to any workload, including to Istio control plane (e.g., remote Envoys need to get configuration from Pilot, check and report to Mixer, etc.).
To better support multicluster and multi-network scenarios, Istio release 1.1 introduces the concepts and implementation of Split Horizon EDS and SNI aware routing. EDS is the Endpoint Discovery Service (EDS), a part of Envoy’s API. It runs in the Pilot component and is used to configure the Envoys data plane with service and endpoint information. With Split Horizon EDS, Pilot will return endpoints relevant to the cluster where the connected sidecar runs.
SNI aware routing uses the TLS Server Name Indication extension to indicate and determine the connection’s target. It allows Istio Gateways’ Envoy to intercept and parse the TLS handshake and use the SNI data to make a decision about the service endpoints to connect to.
Cluster-aware (Split Horizon EDS)
To provide a cluster or network context to Istio, each cluster has a “network” label associated with it. Typically, we would use a different label value for each cluster, but this can be tweaked if you know that multiple clusters are part of the same logical network (e.g., directly routable, low latency).
In addition, each cluster has an associated ingress gateway. Since it is used only for inter-cluster communication, ideally the ingress gateway is separate from the cluster ingress and not exposed to end users. The ingress gateway shares the same network label value as other workloads in the same cluster. This is used to associate the in-cluster service endpoints with ingress gateway.
Pilot collects the list of services and their endpoints along with the network label of each. Currently, this is done directly from the Kubernetes API masters by Pilot but will soon be replaced with a flow where Pilot receives information only from Galley. Endpoints under the same service name are assumed to be part of the same service. That is, endpoints of service “login” in one cluster are identical and directly interchangeable with endpoints of any and all other “login” services in other clusters.
Sidecar proxies provide their own network label when connecting to Pilot and receive an endpoint set that contains IP addresses for all local instances and gateway IP addresses for instances in remote clusters. Since all belong the same service (or upstream), Envoy can load balance the request between local and remote endpoints. To facilitate some form of load distribution, the weight assigned to gateway endpoints is skewed to reflect the number of instances that the gateway “front-ends.”
The design benefits from having the list of clusters and remotes centralized in one location, giving a sort of mesh-wide view into participating clusters.
Due to the use of SNI-based routing, routing control is simplified and configuration minimized. Since the SNI field is used to propagate arbitrary information, all of the existing Istio routing functions, including for example subsets and percentage based routing, could be made to work.
From a management perspective, the mesh functions as a single logical domain: all services are exposed to all clusters (and Istio RBAC policies could be used to limit access if desired), instances from different clusters are assumed identical if they share a service name, “control plane” cluster access to remote API masters is enabled, etc.
In this architecture:
- Only gateways need to be routable from clusters, and internal network CIDRs are not exposed.
- Pass-through mTLS (for SNI routing) via gateways.
- Root CA configuration needs to be managed by the user.
Gateway connectivity is another feature introduced in Istio 1.1. Like Split Horizon EDS, it uses gateways and SNI for inter-cluster connectivity and communications. However, it is more aligned with a mesh federation pattern, where each cluster could be managed separately and independently and services in the local cluster and remote clusters are not merged.
Gateway connectivity relies on DNS resolution to allow services to resolve to local or remote instances. The pod’s DNS resolution is modified to add a “.global” search suffix as a fall-back option, in addition to the default “cluster.local” and “<namespace>.cluster.local” suffixes used by Kubernetes.
Remote services are configured using a service entry which includes a “.global” service name, a destination gateway of a remote cluster, an unallocated host local, or loopback address (i.e., 127.0.0.0/8 but not 127.0.0.1). These addresses are not routable outside the pod and are only used to allow name resolution to complete and the client to create a socket connection that can be intercepted by Envoy. The administrator is responsible for assigning a unique host local address to each remote service.
When a workload attempt resolves service name “foo,” the DNS client assumes it is not a fully qualified domain name (the client is configured to treat any name with less than two dots as incomplete, and this is also a DNS configuration added by Kubernetes) and starts iterating through the list of suffixes. That is, “foo.ns.cluster.local,” “foo.cluster.local,” etc. If the target service name is local, it would resolve to the in-cluster service VIP. Otherwise, it will continue processing through the suffix list, ultimately trying to resolve the “foo.ns.global” name. The cluster’s DNS server is configured to forward the “.global” zone to an Istio CoreDNS server, which is also deployed in the cluster. The extension uses the service entry definition to return a host local address to the client, which uses it to create a new connection that is intercepted and SNI routed by Envoy.
Note that all these cluster DNS configurations (e.g., zone forwarding, suffix list additions, etc.) typically require admin privileges.
While the scope of administrative sharing is lower in this use case (e.g., service endpoints are no merged), we’re still assuming some level of shared cluster management. Service entries for remote services must be defined in every cluster, and any change (such as Gateway IP change) must be synchronized across all possible clusters. This would obviously benefit from automation and does not scale, manually, to more than a handful of clusters.
Also note that we partially trade ease of configuration and access control: knowing the remote GW IP and how Istio encodes SNI names is sufficient to pass through gateways to any remote cluster service. This implies that gateway IP address should not be routed from outside and that RBAC policies should be used, where needed, to limit service access from unintended consumers.
In this architecture:
- Shared root CA, same as in previous cases.
- Pod and service CIDRs may overlap, gateways should be exposed to remote clusters.
- DNS resolution.
- mTLS via gateway to remote service.
In general, we expect all Istio features to continue working as is. Things like routing, security policy, metric collection, etc. Ignoring bugs, they should all work out of the box. Configuration might be a little more complicated than in the single cluster case, but hopefully not by much.
The designs focus on solving networking and connectivity first and foremost. Other multicluster concerns, such as providing both local and global observability functions, are out of scope and should be resolved by users based on their configuration and needs.
Even on the networking side, there is work left to do. Some are minor, some are larger areas for improvement. Just to cite two examples:
- When using SNI routing and pass-through in gateways, the gateways can only provide TCP-level metrics and would not be present in a distributed trace (i.e., the spans would show A calling B, not A calling B via ingress-gateway).
- Cross-cluster load balancing is still basic, using weights based on instances and not accounting for network aspects such as latency or bandwidth limits to remote clusters.
While improving over time, usability is still somewhat of a concern. Documentation can be somewhat scant, and automation is needed to improve the operator’s life. As noted, both designs require configuration across multiple clusters. This could be cumbersome and error-prone over time as new clusters and services are added or removed. In addition, installation and setup steps could be streamlined to create a better end-user experience.
In closing, we’d like to provide a possible view to when single mesh and mesh federation patterns make sense.
Single-mesh scenarios seem to be more aligned to use cases where clusters are identically configured, sharing namespaces, services, service accounts, etc. Typically, clusters are treated by the organization in this case as the “compute” infrastructure—the new IaaS. Teams may deploy their applications to any cluster to match their reliability, availability, or locality requirements since mesh can also be used to distribute service instances between clusters. However, to allow efficient cross-cluster calls, you would want to ensure that clusters in the same mesh are relatively close to each other (in terms of latency).
Mesh federation scenarios seem to be better aligned with a usage model where clusters are used for isolation. Teams would typically be assigned their own clusters and those would be managed independently. There are no assumptions on namespace or service name similarity between clusters belonging to different teams. We envision the multi-mesh model to be useful in cases where you want to selectively expose some services from a cluster to remote workloads while keeping other services private to the cluster. There is no automatic merging of service endpoints, so this may also be useful in cases where cross-cluster calls might be more expensive, latency-wise.
Lastly, nothing prevents you from mixing the two patterns in your deployment. For example, using single mesh for clusters in the same Affinity Zone (AZ) to provide High Availability (HA) of service endpoints and mesh federation to enable controlled access between different geographical regions or staging and production environments.
You are more than welcome to leave comments so that we have a better understanding of the community multicluster use-cases, requirements and prioritization.
In addition, please consider spending a few minutes to answer the multicluster questionnaire at https://goo.gl/forms/GFMQ6AL0tQFbGCYx1.