In over 3 decades of IT one of the consistent themes of my job has been transforming under performing business units for various clients. Some of the scenarios I've been called in to transform have been
- retailers suffering from unplanned outages causing revenue loss and potential loss of customer loyalty for their brand
- customer service centers having users calling the call center because the application is not performing at "market speed"
- financial institutions or health organizations having to shut down services due to users seeing other user's data
Those are just three examples of under performing scenarios each of which affects the end user's experience. If those user's are not locked into the brand they could potentially leave and go to a competitor and never come back. The underlying theme in the scenarios needs to be addressed by someone with extensive operational and application development background in information technology. Here are some of the strategies I have used to transform under performers to over achievers.
One of the first things I learned in my career was taking charge. This is not as easy as it sounds. First, it meant having extreme confidence in myself and my decisions. In the early days sometimes I was successful but other times I was not and needed to step back and reconsider what the next steps had to be. What follows are some of my lessons learned with taking charge.
Under performing business units tend to have difficulty making decisions. Quite simply there are too many ways to do any single technical task or effort that solves a problem. When an organization runs by consensus it is even harder to make a decision. I learned early in my career that giving any organization a choice even as simple between effortA and effortB was futile. I have spent countless hours in meetings over the risks and ramifications of the two, or more, choices. I found that providing one solution and only one solution as the most expeditious way to move an organization forward. All the while keeping plan B in the back of my mind in case my first decision hit a road block. However, at this point I have had enough experiences and previous failures to pretty much nail what plan A needs to be and how to execute it.
Have a plan and articulate it
A plan means nothing if the plan makes no sense to anyone else. When I provide the direction to move an organization forward it comes with a step by step plan that addresses
- short term, immediate tactical steps, risks and goals addressing the must haves for the current problem(s)
- intermediate and longer term approach to the would like to have goals
Even the short term, immediate tactical steps may have several iterations of different efforts that can span weeks or months depending on how severe the problem. Inevitably regardless of the scope of the actions requires actions in both operations and application development. Though sometimes I got lucky and it was only one or the other. But not often.
Start with the basics. Have things like the recommended OS level tunings been applied? If not then that is the first part of the plan.
It should go without saying that any plan should be testable outside of the production environment. However, as applications mature and the user base grows so does the operational IT environment. Test as much as possible and keep production changes down to one change per change window with a tested back out plan. Which brings us to the next topic.
Repeatability (AKA scripting)
To minimize risk a robust operational IT infrastructure requires the ability to perform tasks over and over again understanding exactly what the resulting output should be. Whether it is configuring a configuration item or deploying an application we should clearly understand the end result. Scripts developed and tested in test that work can be promoted to production. I'll note here that with the advent of devOps this facet of IT operations has become significantly easier and more robust than in the past. In some of the testing I manage internally for very large scale performance testing of 10,000 Liberty servers in a SoftLayer environment I know that our gradle scripts will build out the environment from scratch the same way each and every time.
Change is a scary word for a lot of people because it also means risk to the business. This is why change processes should be followed meticulously. Having redundancy and lots of it also reduces risk during a change. If redundancy doesn't exist then it needs to be the first part of the transformation plan.
Operational, infrastructural changes versus application fixes
I have always separated operational changes from application fixes in the same change window. Depending on the speed that application fixes can be identified, coded and tested ultimately depends on how quickly application fixes will be introduced. Sometimes it is an easy code fix but other times whole architectures or designs need to be re-worked due to poor decision making. Though the same level of complexity can exist in the IT infrastructure slowing down the speed with which change can be made because developing scripts or testing can take a lot of time. And testing the back out plan can take longer than testing the solution.
Move the organization to be proactive
Under performing organizations typically react to problems. That means that the problem(s) may have been impacting the end user's experience for some time. Application monitoring is key to helping an organization react proactively to problems. I once lead a business unit that was penalized every quarter for server uptime not meeting SLAs to collecting a bonus every quarter for exceeding the SLAs. All by installing and configuring the right tools to allow the organization to identify problems and notify the right people to rectify the problem before the end user ever noticed.
One thing I try to leave with each business unit I've transformed is how to innovate. This is how the business unit goes in to high achiever mode. Mentoring them in how to think differently about problems and the approaches they take in IT. Encouraging wild ducks so to speak.
Transforming under performing business units takes as much leadership as it takes technical prowess. I have found that the more prominent the problem (e.g. complete application outage) the easier it was to troubleshoot and fix. Intermittent issues or glitches like every once in a while our response time goes from 30ms to 560ms tended to be more difficult as capturing data (nonetheless the right data) at the time of the problem can be difficult. But that only means more effort needs to be spent on the application monitoring tools in order to flush out the necessary data.