Modern Network Storage Options for VMware
Achim-Christ 110000G195 Visits (5862)
VMware vSphere continues to be the preferred virtualization platform for business critical applications in a lot of enterprises. Many of these environments rely on live migration of virtual machines from one physical server to another. VMware's VMotion technology allows for moving workloads between ESX hypervisors without service interruption, enabling enterprises to leverage the full potential of the virtualized infrastructure paradigm.
Naturally, each hypervisor needs access to a set of shared storage resources for this live migration to work. VMware supports Fibre Channel or iSCSI Storage Area Network (SAN) or Network Attached Storage (NAS) for this purpose.
Spanning virtualized infrastructure across multiple failure domains (sites) leads to highly available 'stretched-cluster' configurations. Such architecture allows for dealing with planned and unplanned hardware outages in a flexible way, but at the same time imposes very specific requirements on the underlying storage infrastructure. Availability of the shared storage becomes the essential factor to determine overall resiliency of the solution.
A Flexible Storage Fabric
IBM Spectrum Scale, when added to this scenario, allows for building a flexible storage fabric to support such configuration. High availability, mirroring, and the ability to dynamically add, remove and migrate backend storage resources is natively built in. Add to this data management features, such as tiering and compression, and one ends up with interesting deployment options which make for a future proof storage layer for virtualized infrastructure.
Spectrum Scale is a robust solution for mirroring data between failure domains, and provides flexibility for doing so. Replication can be based on Fibre Channel or Ethernet technology depending on infrastructure availability. If the underlying network layer supports it then Spectrum Scale is able to utilize RDMA for lowest possible latency and overhead (e.g. with RoCE, InfiniBand or Intel Omni-Path).
Furthermore, Spectrum Scale offers integration choice for ESX hypervisors by providing native NFS or iSCSI services. There are two options available for NFS: one can use the native Linux Kernel NFS server, which is complemented by Spectrum Scale cluster functions (also known as clustered NFS or cNFS). Alternatively, Spectrum Scale tightly integrates NFS Ganesha into the management and monitoring framework which, alongside with iSCSI, is known as Spectrum Scale Cluster Export Services (CES).
NFS v3 exports or iSCSI volumes provide the foundation for highly available, shared datastores for ESX hypervisors. This allows for integration of Spectrum Scale without having to install any additional software inside the virtual machines. A limitation of this architecture is that e.g. tiering does not operate on individual files inside the virtual machine's file system, because these are held inside larger VMDK files. Only such VMDK files are visible to Spectrum Scale. For as long as the virtual machines are running these files will be 'hot' — which will effectively prevent usage of tiering capabilities to offload less frequently accessed files to cost-efficient storage.
If further granularity is required then virtual machines can be configured as Spectrum Scale clients. This will enable usage of data management functions on a per-file basis, but requires the installation of (GPFS) software inside each virtual machine. Examples for these value-add functions are tiering (ILM and HSM), selective replication, compression and encryption.
Spectrum Scale itself is able to tolerate network latency to a large degree. As long as all nodes are able to renew their lease in regular intervals (which can be adjusted) the cluster will basically work. The network latency which users are able to tolerate mostly depends on the application requirements.
An important aspect when running e.g. Microsoft Windows VMs is that any IP failover operation needs to complete within 30 seconds or less. That is because the standard SCSI disk timeout value on the guest operating system defaults to 30 seconds. If this timeout is exceeded (e.g. during IP failover) then VMs might encounter I/O errors on their virtual SCSI disks.
To prevent this from happening, Spectrum Scale configuration can be adjusted so that node failures are detected and recovered quickly. With default settings all nodes renew their lease every 35 seconds. If a Spectrum Scale node fails immediately after it has renewed its lease then it may take as long as this duration for the remaining nodes to even notice the failure.
Subsequently, the Spectrum Scale cluster manager will attempt to contact (ping) the failing node prior to initiating any recovery actions. The length of this attempt also defaults to 35 seconds. Hence it can take more than a minute for Spectrum Scale to recover from node failures. This is too long for such failures to go unnoticed by the fore mentioned guest operating systems.
To fix this, Spectrum Scale timeout values can be reduced like so:
With such configuration, any node recovery action is initiated not more than 20 seconds after the node has failed. This will effectively ensure that guests do not encounter I/O errors on their virtual SCSI disks during Spectrum Scale node recovery.
Read Replica Policy
Another important design aspect is I/O read preference of the 'stretched-cluster' configuration. If Spectrum Scale spans multiple failure domains then one typically relies on synchronous data replication within GPFS file systems: each data block is written twice to disks in different failure groups. When later reading this data Spectrum Scale now has two copies of each block to choose from.
The read preference, determined by the
If, on the other hand, specific requirements make it necessary to avoid read requests from being sent over the WAN altogether (i.e. because bandwidth is limited and/or results in cost) then the configuration can be altered like so:
With this configuration Spectrum Scale solely reads from the local copy. Depending on the network topology it may be necessary to define separate IP subnets for each failure domain in order for Spectrum Scale to be able to determine which nodes are local.
Note that the above settings only affect I/O read operations. Data is always written to both failure domains in parallel, according to the concept of a synchronous mirror.
The above architecture is typically implemented with dedicated physical infrastructure for Spectrum Scale: one, two, or four physical servers in each site, depending on high availability and performance requirements. For smaller environments, however, it is also possible to implement the Spectrum Scale cluster based on virtual machines — VMware is a fully supported server platform.
Spectrum Scale servers require raw device mapping in such configuration; meaning that backend LUNs are directly mapped into Spectrum Scale VMs. As in the previous example, Spectrum Scale then provides the storage for one or multiple datastores via NFS v3 or iSCSI. Spectrum Scale VMs do not support VMotion — but that's no drawback at all…
With a dedicated Spectrum Scale VM in each hypervisor, servicing the local datastore, high-availability does not play a vital role in this scenario. The Spectrum Scale VM is offline only if the hypervisor fails, and in this case the storage backend for the hypervisor's datastore is not required any longer. Thus, a single VM is sufficient in each hypervisor.
Citing VMware, Network Attached Storage has matured significantly in recent years and it offers a solid availability and high performance foundation for deployment with virtualization environments. IBM Spectrum Scale makes for a powerful and flexible storage fabric to fuel next generation virtualized infrastructure. Plenty of options allow administrators to customize the solution for almost any given scenario.
The configuration parameters mentioned in this article, and many more, are documented in the Spectrum Scale Wiki: