During a 14 month engagement, I worked with members of several teams to continuously enhance existing processes, maximize the development team's productivity, and meet all of the client’s set expectations with regard to a large-scale web application that was critical to the client's overall success in a very competitive industry. What the application actually did is not important to this discussion; rather, I would like to share with you some of the lessons learned while leading a team responsible for delivering continuous releases to a business critical application.
The development team we had consisted of seven developers, a business analyst, a deployment manager, and a team technical lead. It might be helpful to take a quick look at each of these roles:
- Business Analyst
Most tickets received by the production support team were defects written against the existing code, but in some cases the tickets received by the support team required an update in the business rules. In these cases, the business analyst works with the business team (the client) to create a requirements document that is used by the tech lead and developers to work on the solution development.
- Build and Release Manager
Writes build scripts and manages the deployments to the testing and production environments. The build and release manager is also the administrator of all build tools, including a continuous integration tool, and monitors all configuration changes and logs related to build and deployment.
Responsibilities include investigating and fixing the tickets assigned to them, performing peer reviews, documenting major changes, and validating code fixes in different environments according to the process.
- Team/Tech Lead
Manages all releases, enforces process, and assigns tickets to the developers. The lead acts as a liaison between the developers and the business team, and makes sure all dates are met and the timeline is followed. Also serves as an advisor during code development and helps with architectural decisions.
- Testing team
Test all fixes and ensure quality of the new code baseline prior to the release to production.
At the beginning of my assignment, all team members worked on every weekly release. We had a timeline that allowed for any number of builds to test servers with no specific deadlines. The processes were not fully developed or enforced, and the developers found themselves losing time focusing on the wrong tasks or taking over responsibilities that were not theirs, which left them with minimal time to actually develop. This lack of structure led to a situation where we couldn’t correctly or even roughly predict the number of tickets that would be delivered per release — and worse, deadlines weren’t being met. Due to the uncertainty over which tickets would be fixed in any given release, the testing team didn’t have enough time to prepare the testing scenarios for the tickets being delivered to them, and ultimately could not fully test them before the new code baseline was to be deployed to production. In many cases, the end result was that we needed to postpone the release by a day or a week and combine it with the following release; in some cases, the new code baseline would be released to production without being fully tested, causing even more tickets to be opened.
In order to overcome these (not uncommon) difficulties and successfully deliver a weekly release, we had to explicitly define our challenges and objectives and work on fixing existing processes to allow us to have a clear and defined development cycle to follow.
The critical objectives were to:
- Enhance productivity
- Enhance quality
- Meet deadlines
- Meet the client’s expectations
- Accomplish these enhancements progressively without interrupting weekly releases.
In order to eliminate the obstacles that were hampering our achievement of these objectives, we implemented a set of strategic actions, described in the sections that follow.
Action 1: Implement peer reviews
In our first trial, we kept the weekly release schedule as is, meaning that we would release a new code baseline once a week as before. The first change we introduced was adding peer reviews to the process.
Before, developers didn’t perform a peer review; they would finish working on a code fix, check it in the new release branch, test it on a development server, and then let the testing team test for defects. Introducing peer reviews enabled us to deliver higher quality with fewer defects with minimal disruption to the development cycle.
When a ticket was completed by a developer, an assigned peer would perform this review process:
- Review the ticket and requirements and make sure the code fix is complete and has addressed all the reported problems.
- Review the code change and make sure that it is programmatically correct. (We researched a variety of peer review guidelines and created our own.)
- Review the documentation (if it was required).
- Run and test the code fix in the development environment.
If any of the above points failed the test, the reviewer assigned the ticket back to the developer with comments. The developer would work on and complete the task, and assign it back for peer review, and so on, until the ticket is approved by the reviewer. The result is better quality code heading to the test team.
Action 2: Alter the timeline
The next change was to create a new timeline for all teams to follow and then to enforce it. We also limited the number of builds to the testing server to two.
Figure 1 shows a sample timeline for a weekly release.
Figure 1. Timeline for weekly release
In this timeline, the first column describes the different tasks. The second column lists the group that will be in charge of the task. The remaining columns are for the days of the week of the release.
Let’s look at each of the task involved in the timeline:
- Code stabilization and submit for peer review
The developers must be done with all code development and make sure all their code has been checked in, deployed, and tested in the development environment. They then hand over the tickets for peer review.
- Code check-in deadline and end of peer review
During peer review, the reviewer might send the ticket back to the developer to fix any issue that is found. The developer and reviewer are expected to complete this cycle and have the ticket ready and finalized by this deadline.
- Code merge
Perform code merge between previous and current releases.
- Deployment to test environment
The build and release manager will deploy the new code baseline to the testing environment and notify the developers once this is done.
- Developer validation
The developers perform a validation on the testing server and confirm that all their code fixes have been deployed properly and are working on the server as expected. If they find any data or configuration issues, they will work on fixing it and document the problem to make sure it is properly handled if it occurs on production servers.
This is when the testing team begins testing.
- All teams sign off
Go/No Go Call meeting takes place with at least one person representing each of the development, build and deployment, business, and testing teams, plus any other involved customers. This is the last meeting in a release’s lifecycle. At this time, all teams will clear the release for deployment to production, and the testing environment is released for use by the next release. If there are any issues or concerns that were not addressed during the release, they will be brought up in the meeting and an action plan is put in place to address these concerns. If there is any delay, the release can be postponed if warranted by the situation.
- Production deployment
This is when the build and release manager releases the baseline to production servers.
Now, let’s walk through the timeline:
- In the above example for the 10.21.0 release, the development team’s first deadline is 12/5 at 1:00 pm to stabilize the code and submit it for peer review. The actual code check-in deadline is on 12/5 at 5:00 pm. This gives the developers four hours to perform peer reviews and fix any issues that are found.
- On Friday morning, the build and release manager will deploy the code to the testing environment and then notify the developers who will perform the validation.
- By 1:00 pm, the testing team will perform testing through the end of the day.
- The second build starts on 12/9, imitating the first part of the cycle: the development team will work on any defects opened by the testing team, perform peer reviews, and check in the code by end of day.
- On 12/10, the new code will be deployed to the testing server and the testers will confirm that all defects have been fixed.
- On 12/11, the testing team will continue testing until the Go/No Go Call.
As you can see from this timeline, if you go back to 10.20.0, if a developer doesn’t have any defects from the first build, he can start working on the next release immediately after finishing the validation step on 11/29. This gives a developer the maximum of 36 hours of development time per release. If he needs to work on defects, that means roughly 24 hours of development time. In the case of opened defects, the team will always give the 10.20.0 release priority over the 10.21.0 release to fix defects; when done they will continue working on 10.21.0 release.
After implementing the new timeline, the releases were very small, averaging four tickets fixed and delivered per release (some tickets required more than 36 hours of development time, meaning that not all developers contributed to every release). This was consistent with previous numbers from previous months. The difference, however, was that we were finally able to stabilize the process — and as the team started following the process, the quality of the releases improved.
The complaints at this point were primarily from the testing team, because they didn’t have enough time for testing, and some developers, who would have to rework a ticket after a peer review. On the other hand, the business partner was happy that we were able to assign tickets to release dates and deliver them on time, but was not satisfied with the productivity in that they wanted more tickets to be resolved on a weekly basis.
In fairness, this schedule also didn’t account much for any outside problems, such as server outages and other systems issues. Given the tight schedule, these kinds of problems often impacted the deliverables and release dates, sometimes resulting in a release with only one or two fixes, or postponing a release and merging it with the subsequent one.
As we learned more from this approach, we compared the pros and cons and negotiated with all the different parties, and developed two new different timelines, offering different options to the stakeholders:
- Option 1: The first alternative was to split the team in two so that each smaller team would work on a release every other week. This gives each team time to develop and provide support for their release until it is deployed to production. This also permits the weekly release schedule as before.
- Option 2: The second alternative was to keep the team intact as a single team, but change the release schedule to bi-weekly (one release every two weeks).
If you look at the timeline for Option 1, you can see there is some overlap between the releases. Although the same resources are not involved, this approach requires managing two teams working on different stages of their releases at the same time. Moreover, some tickets would require more development time than what is allocated, pushing the developer to work on a different release than they were originally assigned, causing an unbalance in the team’s size and, consequently, output.
Figure 2 shows a sample of the timeline for weekly releases with two alternating teams.
Figure 2. Timeline for weekly release, two teams
Following this plan:
- Team A works on the 10.20.0 and 10.22.0 release while Team B work on 10.21.0 release.
- Team A will finish working on 10.20.0 release at mid day 11/28 and begins working on the 10.22.0 release at that time.
- The development time for each release varies from 6.5 to 9.5 days, depending on whether there are defects or not.
- For 10.22.0, the team will have their first deadline where they will have to submit the code fixes for peer review on 12/10. By this time, all the code fixes should be checked in the 10.22.0 release branches and deployed by the developers to the development environment and validated.
- We also have to merge the 10.21.0 release into the 10.22.0 release baseline. This will be done after the 10.21.0 release has been approved to be released to production.
- On 12/12, the build and release manager will deploy the 10.22.0 baseline to the testing environment, and then the developers will perform the validation and testers will start testing afterwards.
In the new timeline, testers get about two days of testing. All the testing for the tickets is expected to be performed and completed by end of day 12/12 and the regression testing (if required) will be performed on 12/13. The development team will work on the defects as soon as they are written and make sure they meet the second build deadlines which are on 12/16. The second build and deployment to the testing environment will be on 12/17, followed by developer validation and releasing the environment for the testing team.
Notice that during the same week that Team B is working on development for the 10.21.0 release, Team A will be working on finalizing the 10.20.0 release, performing peer reviews and developer validation, and making sure that all opened defects are fixed on time for the second build for 10.20.0. Team A can start working on 10.22.0 if they are not assigned defects from the 10.20.0 release. Also, notice that until 12/5 the development environment will be used by Team B working on 10.21.0, so Team A will have to work on developing code locally until 12/5.
Option 2 is a different step forward and allows for higher productivity by bundling the tickets and removing some of the overhead found in previous models. In this model, we will be one team and divided into two. We will have fewer deployments to testing servers (which represents some down time for all teams) and the code merges will be minimal.
Figure 3 shows a sample of the timeline for bi-weekly releases with a single team.
Figure 3. Timeline for bi-weekly release, one team
Table 1 shows a comparison between all three release timelines.
Table 1. Development release timeline comparison
|Project element||Weekly releases (Original)||Weekly releases with alternating teams||Bi-weekly releases|
|Development time (days)||3 to 4.5||6.5 to 9.5||6.5 to 9.5|
|Developer validation (hours)||4||4||4|
|Peer review (hours)||10||13||21|
In the case where we have an important ticket that can’t wait until a planned release, we still maintain a timeline for the week in which we don’t have a release planned. This provided assurance to the different partners that we would be able to still work on unplanned releases as needed and acted as a “safety net;” we would be able to release on those dates and not affect the planned releases. For example, if we needed to activate the timeline for an unplanned release to fix a ticket, we would only need one developer on point for it, with minimal build, deployment, and testing, enabling us to keep the focus on the main bi-weekly release.
Figure 4 shows a case in which an unplanned, exceptional release is required.
Figure 4. Timeline for exception release
Understandably, the development team preferred the second option for bi-weekly releases, but it was not an easy sell to some of the other teams. Ultimately, the pros of this approach outweighed the cons and we were able to convince our partners to move to a new plan where we would have a release every other week. With the second timeline option implemented, we were able to maximize the release cycle and allocate more time for peer review and testing.
Given this new release timeline, the team’s productivity increased by nearly 100%, as determined by the number of tickets closed and the number of development hours per quarter. The team delivered on average 16 tickets every two weeks, compared to four tickets per week originally.
Action: 3 Refine service level agreements
Besides defining the timeline and release cycle length, we needed to define (and enforce) the necessary criteria for a successful release:
- Deadline to submit a ticket into a release
We added a new point in the services level agreement (SLA) stating that the deadline to submit a ticket into a planned release was six business days prior to the first build code freeze (excluding high importance tickets that required immediate attention). That also meant that the ticket is ready for development; for example, if a requirements document is required, that work should already be done.
- Require peer review
By mandating peer reviews that include reviewing the code, ensuring the fix is complete, and validating that the problem is addressed and the fix is tested, we ensure higher quality and fewer defects. (This step has cut the number of defects opened per release by over 50%, and has ensured that developers can confidently start working on the next release as soon as they are done with the current release).
- Number of builds to testing environment per release cycle
Originally, we permitted an unlimited number of builds, driven by how many tickets can be solved and how many defects were opened per day. In some of the weekly releases, we would have a build every day, and even more on the last day before production release. We decided to limit that to two builds per release cycle. The SLA now stated that all tickets assigned to a release need to be ready for testing in the first build to the testing servers.
The testing team needs to follow the timeline and finish testing all tickets within the assigned time slot, open the defects as soon as they are found, and assign them to developers, who will then work on the defects to be fixed by the second build to the testing environment. The testing team will have a shorter time the next time around, but they only need to validate the fixes for the defects that were opened against the first build.
The testing team is now given more time to work in the testing environment with minimal interruptions from the multiple builds and deployments they formerly needed to deal with. They are also able to better manage their resources since the cycle is now defined.
They also know the exact list of tickets that will be in the release seven business days before they begin working on it, giving them ample time to prepare their test cases and scenarios to ensure a higher quality deliverable.
- Actions taken when deadlines are not met
Of course, there are times on occasion when the level of effort estimated is not accurate, causing deadlines to be missed. The SLA defined three options to handle this situation:
- If the required extra development time is less than half a day, we push the whole timeline half a day to provide time to deliver.
- If the required extra development time is more than half a day, we roll back the changes and move the ticket to another release.
- In very special cases, when the ticket has to be fixed in the currently assigned release, we allow the ticket to be ready by the second build deadline. Extra peer review time is allocated and the testing team is notified ahead of time to make sure enough resources are available to finish testing on time.
- Actions taken when we have a defect after the second build:
On rare occasions, we might face a situation where we had a new defect opened after the second build. In that case, we have two options:
- If the defect causes more problems than the original reported problem, or if the business partner deemed it as high risk, we roll back all code related to the ticket that caused the defect and work on it in the following release. This way, enough time is given for the developers to completely fix the problem. In this case, the developer will roll back the code change and a third build to the testing server is required. The testing team will then retest and make sure the baseline is back to prior state.
- Deploy the current release with the known defect and fix the defect in the following release — or if it’s critical, as an emergency fix whenever it is ready.
The decision is mainly made based on whether the new fix with the defect has more drawbacks compared to the original ticket or not. If it does, we take it out of the release. If not, we deploy the fix with the known defect and deploy another fix in the upcoming release.
Action 4: Manage version control
Few things are as important as version control for a software development project, especially when all the elements of the project are amplified with regard to project scale, team size, and visibility.
- Branching and merging
Every time the team started working on a release, the process indicated creating a new branch from the trunk. The developers were asked to keep track of all open branches and check in their code in all the open branches. We changed that process so that we create the branch from the latest production support open branch and developers would only check in their code to the release they were working on, reducing overhead of keeping track of several branches. In both cases, we needed to perform a merge with the latest deployment to production.
The merge would be performed by one of the developers following a rotation. Given the bi-weekly release schedule, you can see on the timeline that, based on when the branches are created and whether we have a second build or not, the merge might either be not needed or very minimal.
- Managing the available environments
The original setup was as that every release would go through three different environments (servers) before being deployed to production:
- Development: Developers used this environment to test their code after they finish developing and testing in their local environments. Developers had full access to the server and could deploy the code using a build tool. There was no need here for a release manager. All developers used their local setup for development work and, once done, used the development environment to test and perform peer review in that environment.
- Stabilization: This environment was setup to be an exact duplicate of the testing environment, which should be an exact duplicate of production. The release manager will perform the build to test his scripts. The developers validated that all the checked-in code and configurations were successfully deployed. If there are any problems, developers worked on fixing it, and the release manager took notes to make sure deployment to the testing environment (and then production) was complete.
- Testing: The release manager deployed the code baseline to this environment and then the testing team would run their test scripts to make sure the code baseline was ready for production release.
One of the changes we made to allow more time for development and testing and have less environment related defects was to skip the second step. Instead, we deploy to the testing environment and allow the developers and the release manager to validate that the build has been successful before handing over the environment to the testing team.
The main problem we had with the previous model was that, in many cases, the stabilization environment wasn’t an exact duplicate of the testing environment, so we ended up having environment-related issues; that is, an issue would present itself in one environment and not the other. For example, the developers would validate their code in the stabilization environment and then pass it on. Then, if there is an issue with the data or configuration in the testing environment, the testers might start opening defects against issues that were not necessarily related to the code.
Now, since the developers validate in the testing environment, we are sure that they are handing over an environment that is free from data or deployment issues, enabling testers to deal only with release defects and not environment-related defects.
The stabilization environment was still in use but as a development environment for developers to test fixes for defects opened from the first build to be ready for the second build.
- Release ID
A minor change we made was to change the release IDs from dates to release numbers. This affected the branches we used and the assigned release ID in the tickets queue. With this simple naming convention, we no longer needed to go through the effort of updating all the references when a release date changed.
This article highlighted some of the lessons learned from my experience working as a team lead with the production support team maintaining a web application for a commercial business. While this actual project was large and long term, many of the challenges described here are characteristic of projects of all sizes. Likewise, the solutions that were executed, including approaches to managing resources, setting timelines, reviewing and changing processes, re-establishing SLAs, and so on, might be applicable — or at least worth discussing — when looking for ideas to enhance the quality and satisfy the commitments for your own development projects.
- 11 proven practices for more effective, efficient peer code review
- Code review guidelines
- Common Branching Patterns
- Subversion Best Practices
- 8 Ways to Improve your Enterprise Release Management Process