Compute Services

Bluemix Cloud Foundry: 10 Lessons Learned in 2016

Share this post:

Introduction

Bluemix is IBM’s public Platform-as-a-Service, based on Cloud Foundry. It’s the world’s largest deployment of Cloud Foundry with more than 1 million registered users running half a million apps and hundreds of services. IBM is proud of this accomplishment and our contribution to the enterprise cloud platform ecosystem.

There are three types of Bluemix deployments: public, dedicated, and private. Public Bluemix deployments are done in IBM’s SoftLayer IaaS and open for all to use. Create an account and start pushing apps. Dedicated deployments are special Bluemix environments purchased and dedicated to a particular customer. Private deployments are done on a customer’s hardware. Public and dedicated deployments are done on the SoftLayer cloud, whereas private deployments are typically done on a customer’s OpenStack cloud environment.

What have we learned?

Every year Bluemix exists we learn how to run this successful cloud platform even better. To manage such a large set of deployments of Bluemix, IBM has a number of development, test, and site reliability engineering (SRE) teams that are deployed worldwide. This allows the entire team to maintain Bluemix 24 hours a day, 365 days a year. Critical to that maintenance is the BOSH technology. BOSH, which stands for “BOSH Outer shell” (bosh.io), is “an open source tool for release engineering, deployment, life-cycle management, and monitoring of distributed systems. It is core to the Cloud Foundry technology and allows IBM operators to perform maintenance and recovery actions on any Bluemix deployments.

Read on to discover the top lessons we’ve learned in the past year operating Bluemix at this incredible and unsurpassed scale. Dr. Michael Maximilien, IBM scientist and researcher, walks us through these lessons. And don’t forget to register for InterConnect 2017, where you can learn how IBM uses other parts of Cloud Foundry for Open First design.

Register for InterConnect 2017 now!

Lessons

(In reverse order of importance, with the problems faced and the solutions)

 Lesson 10: Tightly controlled change request

Problem: In large companies with international teams, the challenge for SRE and development teams quickly becomes one of alignment and control. For Bluemix, in the early days, this meant long team meetings with crazy time zones and overall slow execution on urgent matters.

Solution: To alleviate these issues, the Bluemix team established a change request (CR) process that allows anyone within the global team (both the Cloud Foundry community as well as the Bluemix Cloud Foundry operations team) to participate in change requests. These CRs are done electronically and can be addressed by all team members. However, to ensure control, only the small Operations Control Team has the capability to create and approve new CRs. This allowed changes to move through tests and pre-production and production environments in a timely but disciplined fashion.

 

Lesson 9: Audit deployments for health

Problem: While manually auditing CF deployments periodically gives some idea of the health of the environment and allows manual intervention, it’is not a strategy that scale, especially when running the largest set of CF deployments known. We needed an automated solution that would complement manual intervention along with the automatic audit capabilities that BOSH offers during deploys: canary-based deployment.

Solution: To address this challenge and also our growing worldwide team, the Bluemix team created a global central tool called “Doctor” that allows teams from all over the world to keep a healthy pulse on all deployments, no matter where it’s located. This has been critical in maintaining the overall system and ensuring the reliability and availability necessary for a cloud platform.

 

Lesson 8: Log checking and monitoring

Problem: A quick problem we experienced as Bluemix’s popularity grew and the various worldwide deployments grew, was how fast the resulting logs grew. It was predictable, but came faster than expected. The simple problem is that log rotation happens too frequently.

Solution: The solution was to introduce early in all deployments a log retention policy along with a proprietary tool to ingest, parse, and expose log data to all members of the worldwide SRE team. The CF loggregator stream processing and extension allows different consumers of any environment’s log data, and the key is to be proactive at processing and exposing that data to engineers before problem occurs and the data is required for debugging.

 

Lesson 7: BOSH init woes

Problem: While the adoption of “BOSH init” hinted at the future of BOSH and streamlined the creation of director VMs with a simple binary, it also left some woes that made it hard for large teams to deal with. In particular the difficulty in recreating existing director VMs and the frequent updates as the tool matured proved problematic.

Solution:  The lesson learned here is simply better planning when adopting new updates that are significant departures from the norm. BOSH init is definitely a great direction for BOSH, however, the fact that it changed so many things (not a simple version update) we should have expected things to break while the software matures. Better planning for such updates is now in place.

 

Lesson 6: Migrate all custom software to BOSH releases

Problem: One of the lessons one quickly learns while using BOSH is to have all software setup as BOSH releases. In the BOSH community this is also known as “BOSHifying” your software. As Bluemix grew, various parts of the solution had custom and proprietary software that needed to be added to Bluemix. Sometimes these are for differentiation and sometimes due to solving problems that community code was not addressing.

The problem was that much of this initial software was added by baking it into the stem-cells.

Solution: By forcing all software to be BOSH releases — a process that can be straightforward and sometimes painful — the resulting mix of proprietary and community code that make up Bluemix is now under the control of one unified tool set: BOSH.

 

Lesson 5: Do not use PowerDNS (if possible)

Problem: PowerDNS has long been part of Cloud Foundry releases, while also being noted as not a production-ready solution. It was added to allow the Cloud Foundry release to be self-contained for development and testing purposes but not production. The primary thinking was that in production, one would switch DNS needs, as is provided by the underlying Iaas layer.

The problem is twofold. One, some IaaS do not provide DNS solutions and two, if one uses PowerDNS in production it becomes a single-point-of-failure (SPoF) that is hard to replace.

Solution: The solution here is not simple. The BOSH team is actually working on a solution to remove the need of PowerDNS in Cloud Foundry  releases or dependency to external DNS providers. While this solution is still being worked on, the lessons learned here is to think long and hard about using PowerDNS in your deployments.

 

Lesson 4: Security updates are painful but important

Problem: If there is one reality on the internet today it’s that hackers exist and you will be subject to their frequent attacks. One sure way to fend off internet evil doers is by having your software up-to-date and in particular making sure your software has all its security updates applied and to keep them current.

Solution: To that effect, Cloud Foundry  releases along with BOSH stem-cells updates are done to address various kinds of security updates as well as Common Vulnerabilities and Exposures (CVEs) [4]. Making sure your Cloud Foundry deployments are also frequently updated to apply security updates is of primordial importance.

The lessons learned here is to make sure your deployments have frequent periodic updates to apply security updates. This is particularly important since large Cloud Foundry environments might take some time to update, and security releases might appear to destabilize the environment. However, security updates must be of utmost importance and must be prioritized to the top of the list.

 

Lesson 3: Multi-BOSH deployment

Problem: One of the difficult architectural principle to initially grasp with BOSH is knowing whether to use micro BOSH (BOSH init) or multiple BOSH directors to manage your clusters. Multiple BOSHes allow the director database to be smaller and helps move your director VM in various locations in your IaaS.

Solution: However, perhaps the main reason to plan and split your BOSH deployments is to allow your clusters to be more easily deployed and updated. For instance, by dividing your Cloud Foundry deployment into a cluster that contains the cloud controller and associated jobs to a cluster with the Diego cells, you will have a way to update and grow each independently of the other one.

More importantly by dividing your deployment into smaller clusters the actual updates are faster and don’t cripple your entire production environment. This can be especially important when the environment will need to scale to large numbers of VMs.

BOSH supports these kind of deployments right out of the box. The main difficulty is making sure that your manifest line up correctly, e.g., IP address reservations and networks. And of course, training BOSH operators to now use many deployment manifests and target the correct one for the correct jobs is also key but something that is easy to grasp if you start with this topology early.

 

Lesson 2: Deployments and updates are never 100% successful

Problem: BOSH deployments and updates are some of the primary tasks that all BOSH operators learn on day one using BOSH. And one thing one learns quickly is that these operations, for large clusters, can be time consuming. While these operations are pretty stable, one reality that bites very quickly using BOSH is that deployments and updates sometimes fail.

As a matter of fact, for large clusters, it is often more the case that a deployment or update fails rather than succeeds at first. This can be very disconcerting since the tendency for developers or BOSH operators is to expect the deploy to either pass or fail. However, due to issues with the current state of the deployment or the updates being applied, deployments can sometime fail. In the case of updates, this can be OK and just require a restart.

Solution: The lessons learned is to be more zen about the outcome of BOSH deploys and updates and rather than expecting them to always succeed or fail, one should expect to eventually have success. This might mean that the update requires multiple attempts before success.

 

Lesson 1: Backup your director database

Problem: BOSH maintains all data that it needs to operate in the director database. It goes without saying that backing up this database is a must for any production environment. Sadly, as is repeated in almost all IT enterprises, timely backups of databases are often missing from IT processes.

Losing the disk where the director database is located is an event that occurs and will given enough time and usage. It is therefore paramount that timely backups are done and moved to external disks or better to an external storage cloud. And while BOSH has recently exposed backup and restore CLI commands, taking steps to simply backup the entire disk is a good course of action.

Solution: In Bluemix we’ve experienced lost of a disk where a director database was located, and there was no way to recover the disk directly. The clusters that the director managed were alive and well, but the director, having lost its database, knew nothing of any of the clusters. One solution we came up with was to create a dummy CPI that allowed the director to “replay” a deployment but instead of creating real VMs and disks, the CPI simply returned the IDs of the existing cluster resources. At the end of the “fake deploy”, with some manual intervention, the director database was able to be restored.

 

Bonus lesson: Always seek knowledge (ASK)

Problem: It goes without saying that while BOSH is a complicated tool, it does a lot and allows multi-cloud clusters to be managed and maintained in mission critical production situation. The key to operating all of your Cloud Foundry deployments in an efficient and sustainable manner is to seek knowledge and share with others performing similar tasks all over the world.

Solution: To that aim, there are active Slack channels in slack.cloudfoundry.org for #bosh as well as #bosh-core-dev, #bosh-cpi-dev, where the various members of the BOSH teams hangout and ask and answer questions. It’s important for new BOSH users to take the time to peruse such resources along with official BOSH documentation located in bosh.io [2].

Conclusion

While BOSH is a key tool for the operation of Bluemix Cloud Foundry at IBM, we found that it helps to develop common knowledge around the tool and share best practices.

Over time we believe such lessons learned and best practices could be part of the common knowledge between BOSH users and enhanced globally by all BOSH users.

Bluemix keeps evolving: The recent Bluemix Cloud Foundry Diego migration is evidence of that here . As BOSH continues to add new features (CLI2, local DNS, and others) it makes sense for these lessons learned to continue to evolve and represent the most important shared knowledge from IBM BOSH users.

And thank you for helping make Bluemix Cloud Platform the success it has become! The mission of open source excellence at IBM never ends.

References

Author

Dr. Michael Maximilien is a leading IBM scientist and researcher in IBM Cloud Labs and a driving force behind the IBM contributions to the Cloud Foundry Foundation. Find more information about him here and here.

IBM Cloud Offering Manager

More Compute Services stories
August 16, 2018

Call for Code is Ready to Help Save Lives When Weather Strikes

Perhaps the next great advancement of our age, a better way to protect people from destruction, is an idea you have . . . a code that can save lives. Learn more about Call for Code, a rallying cry for developers to help create the next big solution to disaster crisis.

Continue reading

July 18, 2018

Part III: Wimbledon Facebook Bot on IBM Cloud

Delivering at scale: In the final part of the series, we discuss integrations with on-site systems at the All England Club and how we used Multi-Region within IBM Cloud to ensure scale and availability.

Continue reading

July 16, 2018

Part II: Wimbledon Facebook Bot on IBM Cloud

In the second in a series of posts about how IBM iX designed, developed, and delivered the Facebook Messenger Bot available at Wimbledon 2018, we focus on the broadcast integration within Facebook and how we persisted user preferences using IBM Cloudant and Compose Redis.

Continue reading