IBM Platform Resource Scheduler for Real-World Applications
KhalidAhmed 2700070JYS Visits (1998)
Platform Resource Scheduler (PRS) adds powerful scheduling and resource optimization capabilities to OpenStack environments. It leverages the experience of IBM in handling large-scale datacenters for high-performance computing, analytics and big data and applies that to new environments. In this blog we give a flavor of how PRS can work together with OpenStack in the context of an online e-commerce store. We will explore how OpenStack with PRS features enabled can optimize the infrastructure while helping to meet an applications service level objectives.
On-line retailers are faced with the challenge of rapidly developing applications that deliver personalized shopping experiences across a variety of devices including web, mobile, tablets and in the future connected cars. We examine how such a retailer might develop applications for the delivery of physical or virtual goods. This structure would be typical of any type of retail environment that uses the internet to offer its users access to an online product catalog that they can browse and search, and ultimately make a purchase decision. Leveraging the information of users historic purchase patterns and their contacts gleaned from social networks like Facebook or Twitter, the system can make recommendations about which products they might be interested in. The following provides a high level view of the application architecture. It is meant to be representative and while technologies are called out to provide greater context, this does not represent a specific customer choice.
In order to deliver new releases of the applications in an agile fashion various development, QA, integration and operations teams will require infrastructure to support their own environments. Because of requirements around security, and integration with existing in-house services, the on-line retailer choses to implement an on-premise cloud leveraging OpenStack technology based on IBM’s production-ready SmartCloud Entry or Smart Cloud Orchestrator products.
OpenStack and PRS Setup
IBM SmartCloud Entry and IBM SmartCloud Orchestrator have a variety of features that help integrate and automate various aspects of setting up and configuring VMs, middleware, and applications and integrating them with business processes of the online retailer. They both leverage OpenStack tested and packaged for enterprise consumption. For in this blog we explore how PRS works with the OpenStack native Nova scheduler to optimize the environment.
While OpenStack has a default scheduler that allows for the placement of VMs it lacks several key capabilities. Firstly it only makes decisions based on static information in Nova’s SQL database which is incomplete. Secondly it schedules resource only once during initial placement. There is no ability to optimize the system at run-time to make decisions to replace VMs as the usage of the environment changes and evolves. OpenStack is flexible enough to allow 3rd party schedulers to fit into the framework to provide enhanced functionality. OpenStack exposes scheduler hints that can be passed at VM creation time either through configuration of the flavor or through the nova boot command. PRS fits seamlessly into the OpenStack environment to provide enhanced value.
PRS enhances OpenStack with superior abilities to place VMs at initial deployment, and re-place VMs on new servers, as conditions change during run time. PRS can support global placement policies across an entire cloud/cluster, as well as on individual appl
The following nova commands are used to create the aggregates:
nova aggregate-create web nova
nova aggregate-create service nova
nova aggregate-create data nova
nova aggregate-create management nova
nova aggregate-create devtest nova
Host aggregates can be tagged with properties and hosts can be added to the aggregate as follows:
nova aggregate-add-host 3 node1
nova aggregate-add-host 3 node2
Targeting Nodes with Nova Scheduler Hints
When creating VMs for the different tiers the PRS parameters controlling the placement can be specified as attributes of the flavor or in individual nova boot commands using scheduler hints. These hints are passed to PRS and used to select appropriate nodes to place VMs. For example, the following will create a single VM for ,say MySQL, on the data host aggregate:
nova boot --image 70ca
Note that any attributes specified at the host aggregate level such as ‘ssd’ in the above example can be used to refine the target nodes. The ‘query’ expression specifies a JSON string that can combine various metrics with logical ‘AND’ or ‘OR’ expressions. This matches what can be specified via the Nova JSON-filter. All the OpenStack metrics supported by Nova can be used in building the resource requirements string as well as those added by PRS such as ‘memFree’ which specifies the amount of unallocated memory. It is also possible to directly specify the host aggregate id in the scheduler hint as follows:
nova boot --image cede
Affinity and Anti-Affinity
For a NoSQL environment like Hadoop/HBase, it would be more performant to stripe VMs across a set of nodes in order to take advantage of the full I/O bandwidth. This can be achieved using the anti-affinity feature of PRS which allows an the VMs for an individual request to be striped. So for example the following could be used to create a set of 4 VMs that are spread across 4 different nodes:
nova boot –image cede
Alternatively, if a particular service tier component works best by placing the VMs on as few physical host as possible, for example to take advantage of local caches, this can be achieved by the ‘same_host’ scheduler hint:
nova boot –image cede
Resource Optimization Policies
While policies specified at the VM level can control how individual VMs or collections of VMs are scheduled onto hypervisor, PRS also allows IT administrators to set cluster-wide or host-aggregate specific policies that set service level goals for the system to enforce at run-time. These can be leveraged in IT administrators managing online applications such as the one we are
Load-Balancing Service Tier
The service tier in an online retail application is likely to experience highly variable demand as different types of users coming in through mobile, social or web channels call out to different services. One value that PRS can deliver to the service tier is to continuously monitor the load on the hypervisors and make decisions to migrate VMs at run-time to smooth out imbalances. This ensures more effective utilization of the hypervisor hardware while giving services which may be experiencing more activity a greater percentage of the resources. This can be achieved in PRS using the load-balancing policy. The following example gives a policy XML fragment that shows the parameters of the load-balancing policy at the global level:
<description> CPU Load Balance Policy </description>
<state> enabled </state>
The ‘resreq’ parameter orders hypervisors so that the least loaded are the best candidates on which to start new VMs and as a destination for migrating VMs. That is, the best target for migration are those with the lowest average CPU allocation. The best source are heavily loaded hypervisors with an average CPU utilization greater than 70%. Therefore when certain hypervisors in the service tier get too ‘hot’ they will be migrated to those that are ‘cold’.
To set this policy at the host aggregate level, we add meta-data to specify the policy parameters for a given host aggregate. PRS will read these parameters from the Nova database and enforce them. If the service tier host-aggregate has an ID of 2, and the policy-id of 1 references the global load-balancing policy, then the following command will set the policy parameters for the host aggregate:
# nova aggr
It is possible to update just the threshold after it has been enabled:
# nova aggr
Packing VMs in Management Tier
The nodes used for providing management tools and utilities for the application environments could be shared since the management VMs are not likely to experience high load volumes. So in this case, it makes sense to pack as many VMs on to hypervisors within the host aggregate. The policy parameters for the packing policy are controlled by this XML file for the global policy:
<state> enabled </state>
To apply this to the host aggregate for the management tier use the following:
# nova aggr
Resource Over-commitment For Dev/Test Environments
Dev/Test environments are frequently under-utilized. Developers will spin up VMs and not fully use them or in the worst case, forget about them leaving them idle only logging on occasionally. By default VMs are placed on hypervisors based on the available physical resources (i.e., CPU, memory, disk) and when the resources run out, get consumed, additional VMs cannot be placed on the hypervisor. Over-commitment specifies which resource and by what ratio they can be over subscribed, over committed. Over-commitment is based on the theory that a VM usually does not consume all the resources requested, which leaves spare resources that can be used by other VMs.
The following example shows the configuration that is necessary to set cpu,memory and disk over-commitment ratios for the devtest host aggregate (ID 5 ).
openstack-config --set /etc/nova/nova.conf DEFAULT sche
In above example, with the CPU allocation ratio of 8, Platform Resource Scheduler allocates up to eight virtual cores on a node per physical core. If the physical node has 12 cores, and each VM instance uses four virtual cores, the scheduler allocates up to 96 virtual cores (that is, 24 instances, where each instance has 4 virtual cores).
So that’s it for now. We have an overview of how PRS can be used in an example environment to support various use cases. In the future we will delve more into how the policies work in various scenarios. We will also explore customizing policies and various integrations that are possible with PRS and other systems. There are number of exciting new features coming in future versions of PRS like VM high-availability, utilization-based scheduling and supporting host maintenance operations that we will elaborate on. So stay tuned for more.