Planning for maintenance activities

Cloud services has five key maintenance activities that need to take place. Some are for keeping normal day-to-day operations going and some are contingencies for service restoration in case of a disaster.

The following two activities are not automated and must be addressed by the administrator as follows:
  • Backing up the key library manager - a new copy needs to be made every time there is any change to a key in the key manager. For more information, see Backing up the Cloud services configuration.
    Note: If the key manager is lost, there is no backdoor or secret IBM® internally known way to recover the data from cloud storage. So, it is important to have a backup copy of the key manager.
  • Cloud services leverages SOBAR as a backup mechanism of Transparent cloud tiering metadata that can be used to restore the Transparent cloud tiering service in case of a failure. A sample script is provided in the Transparent cloud tiering directory that can be deployed to run the backups. For more information, see Scale out backup and restore (SOBAR) for Cloud services.
The following three activities are provided for by an automated Cloud services maintenance service:
  • Background removal of deleted files from the object storage. This is recommended to be done daily.
  • Backing up the Cloud services full database to the cloud. This is recommended to be done weekly.
  • Reconciling the Transparent cloud tiering database. This is recommended to be done every four weeks.
The Transparent cloud tiering maintenance service comes with default daily (at night) and weekly (on the weekend) maintenance windows where these activities run. You can use the mmcloudgateway maintenance status command to query what your current maintenance window is. It is important to avoid running migration during these activities so you need to make sure your migration policies run outside of those windows. There are scaling considerations when setting up your maintenance windows. They must be long enough so that maintenance activities fit inside the windows. Consider the following guidelines when you plan these activities:
  • Reconcile is by far the longest activity and the main one to consider when you plan your service windows. A reconcile for every 100 million file container (which is the default spillover value) takes a few hours if you run metadata with flash. If you run it from a disk, it takes more like 6 to 12 hours.
  • Each Transparent cloud tiering node can run one service activity at a time during the maintenance window. For example, if you have three Transparent cloud tiering nodes, you might be running three maintenance activities in parallel at the same time.
  • Keep in mind that maintenance activities run to completion even after the maintenance window has passed. The longest duration outside the maintenance window would be an activity that was scheduled and started just prior to the maintenance window closing. That activity would run almost entirely after the maintenance window had completed.

Here is an explanation of how the maintenance window size affects the ability to scale number of Transparent cloud tiering files. If you have a weekly maintenance window of 24 hours, each node would be able to process two-to-three 100 million file containers per week. Since reconcile maintenance needs to be done every 4 weeks, it would follow that a 24-hour maintenance window can accommodate eight-to-twelve 100 million file containers per node every week. With a full 4-node setup, this adds up to 32 - 48 containers (roughly, that is support for 3-5 billion files) per 24-hour maintenance window for a single node group.

It follows that if you want to use Transparent cloud tiering with more than this, you are likely going to want to consider putting the IBM Spectrum Scale metadata and the Transparent cloud tiering database in flash storage. This will greatly increase the number of files the maintenance window can handle.

For setting up a maintenance activity, see Configuring the maintenance windows.