Leveling Up: Perpetual Process Improvement Perfectionization
As software developers, we should strive to be constantly learning, continually improving ourselves, and our craft. Part of this is to improve how we write code, how we debug and fix problems, but it should also include continually evaluating and refining the processes we follow. Rather than blindly following the same process because of someone, somewhere, at some point in the past put it in place, we should be looking for ways to improve how we do things.
In this issue, I wanted to take a look at how we’ve come along in some of our processes where I work, how we made, or are in process of, making the transition between the old way and the new way. Now that’s dedication*!
* This is a bit of an in-joke at work. We had a resumé come in, listing an item of experience as “We went from the old way to the new way in less than a month. Now that’s dedication!” There was no other detail about what the old way or new way was, but apparently going from the old way to the new way in less than a month is proof of dedication.
The Way of Things at Work
At my company we have three teams, each working on a separate software as a service product. Each team runs in a slightly different way, based on the comfort level of the team, the maturity of the software, and the comfort level of the customer in dealing with deployments.
We have one team that has the oldest and most tendril-y code, and their customer base is the least comfortable with deployments as historically, they have the highest chance of going poorly. Currently, they follow a scrum process including planning meetings, backlog grooming and follow two-week iterations. There’s a code freeze period of 3-5 days so the entire system, or as much as is feasible, can be regression tested and UAT’d (user acceptance tested) before deployment to production.
The second team is following more of a “scrumban” process where they have a prioritized backlog that they work through. They utilize on-demand planning meetings as needed to flesh out stories and find the answers to questions that are not clear. They run a one-week iteration with a short code freeze period.
The third team doesn’t completely follow scrum or kanban, but the process would be closest to kanban if choosing between the two. They report on what they worked on and will be working on, but stories flow through from concept to code continually. Once the code has been completed, it’s pushed up as a pull request for review. When two team members have approved and the automated build has passed, the code is automatically merged into master and deployed to the QA environment. Another series of automated tests is executed and if they pass, the code is promoted to the demo and production environments, again, automatically. This group can deploy potentially dozens of times per day with no downtime.
None of these teams got to these points for free. When I started, there was only one team. They were using a bug tracking package called “FlySpray”, but doing it poorly. At that point as well, FlySpray was not very good at its intended purpose. It appears it may be better now, but I cannot speak to that now. I can tell you when I first heard of it, it was a 4th or 5th page Google result for “Flyspray bug tracking”, so how the team found it and decided to use it is beyond me.
At that point, everything that was tracked, from feature requests and improvements, to actual bugs and crash reports was entered into the system as a “bug” even though the system allowed other designations. I feel there is a psychological difference in how a developer (and a product owner) looks at an issue labeled as a “bug” compared to “improvement” or “feature”. This was one of the first things I pushed to change. If the request was about broken functionality, it could be labeled as a bug. If it was to implement previously unimplemented functionality, then it was a feature. I hounded our project managers on this and told developers to push back on issues that were labeled incorrectly.
Developers view a bug as something that is messed up or broken in the system. It’s typically a deficiency in how the code was developed which leads back to the developer (not all, but many) seeing it as a failure on their part that they could have or should have addressed. When new functionality is labeled as a bug, some developers start to wonder how they are expected to anticipate these shortcomings, which leads to lower team morale since there’s literally no way to fully anticipate everything the product owners or customers might ask for.
The next problem we had was that the bug tracking system was “losing” issues. It was the Hotel California where bugs went in and then could never be found again. I should mention at this point as well, the team was not following any sort of “Agile” methodology. It was strictly recording “bugs” and fixing “bugs” and deploying fixes for bugs, and then soon thereafter, deploying hotfixes to fix the fixes. Deployments would often last late into the night or early morning with all hands on deck to ensure that when there was a problem, the team could slap together a band-aid to patch it and keep things running. At least until they woke up again and found there was another fire to put out.
These deployments typically took 4 hours or more, often significantly longer. They would usually be performed with developers and QA onsite, meaning developers and testers would have to run out and grab food before settling into the office for a significant portion of the evening. I started buying the team dinner of their choice in order to make some aspect of it less awful. The real problem still needed to be addressed. Deployments were long, too error prone, too stressful, and there was not a good reason for it.
So we focused on improving process more in order to reduce this time. One early change we made was to have the team participate in “Agile” training, and we switched from using FlySpray to RallyDev. This brought about workflows and ideas that were not present before. Around this time, the second project team was formed and work began on implementing that new system. With this team, we put a strong emphasis on automated testing. It wasn’t a mandatory requirement, but there were a large number of tests written. TDD was encouraged, but not required as well.
A short while after that, we implemented Jenkins as a continuous deployment server to run these tests. We also switched our source code repositories from Subversion to Mercurial. Mercurial provided a lot of functionality that was lacking from SVN. It allowed us to start implementing feature branches and merges that weren’t going to cause problems when brought together with other code. With the previous SVN repo, it was fairly common to have a developer clobber someone else’s changes by committing to the mainline without bothering to ensure that someone else hadn’t checked in work in the meantime. One particular developer was notorious for hammering his code over everything that had been done in the time since he started his work.
A bit later we found that the way we’d implemented our own Agile processes and how RallyDev wanted you to work were often at odds with one another which lead to the feeling of fighting with the issue tracking system. With a number of different workarounds and compromises and the general feeling of, “something’s still not quite right”, we decided we needed to replace our issue tracker yet again.
We looked at a number of new alternatives with Fogbugz and JIRA being the top two contenders. The Fogbugz philosophy on issue tracking is to get out of the way as much as possible. There are very few required fields in order to get a new issue in the system, and new issues can even come in via an email. This makes it easy to track features and bugs from internal or external customers. When developers work on tickets, they must start by making a time estimate. Then they tell the system when they start work and when they are done. This allows FogBugz to track the accuracy of each developer’s estimations individually. It can then use these historical estimations and actual times to run simulations to determine what a developer is likely to be working on and when they are likely to finish. You don’t even need to be accurate in your estimations for this to be extremely valuable.
Since the system simulates completion times based on your past estimations, it gives back a model of when you’ll be done. Essentially it runs a Monte Carlo simulation 100 times and then aggregates those to form a probability model. If you are accurate, then the estimates for when you’ll likely finish with probably not have too much variation. If you are consistent, but inaccurate, meaning you often over- or under-estimate by a common factor, there will probably not be a lot of variation for when the system thinks you’re likely to be complete on a ticket, but it will take into account that your estimates and reality don’t necessarily align. If you’re all over the place with your estimates then there is likely to be a good deal of variation on when the system will think you’re likely to be complete. All in all, I think that sort of project management or evidence-based projections is pretty excellent. Unfortunately, I’ve not had any luck getting FogBugz into the organizations where I’ve worked.
Ultimately, we decided to go with JIRA. We imported a lot of the historical tickets from Rally, some of which had been dragged along, kicking and screaming, all the way from FlySpray. JIRA is nice in that it allows you to determine nearly every aspect of your workflows, and you can do that on a number of different levels. If you want to, you can set up different workflows on a per-project basis, or even at a per-issue-type basis. This meant that for each of our teams, we could set up processes and workflows in our issue tracker that match the way each team has decided to work. The downside is that there’s a lot of configuration that needs to happen before this works flawlessly. Fortunately, the provided workflows are a good place to start.
Kai Zen
Evolving our workflows in JIRA has been an ongoing process. Evolving our processes has also been an ongoing process. I want to ensure that each team is constantly looking for ways to make things better. Part of this is to make sure that we don’t feel that any part of our process is “sacred”. We may find that certain patterns or ideas work well, hopefully better than all the other things we’ve tried, but even those aspects should be questioned for potential improvements.
Not every experiment will be better, but that’s ok too. The Japanese term, “Kai Zen” means improvement or continual improvement, and it’s something we do strive for all the time. It means that we try to not sit on our laurels or stop trying new things just because where we are and what we are doing is comfortable and working.
For the third team I spoke of, we started out with the idea that we’d like to get to a continual deployment process. This didn’t come about overnight. When the project started, we had a manual deployment process, but it was not on a set schedule. Each bit of code could go through the process to production, but each step along the way there were scripts to run. We eventually got to the point where those scripts were run when you clicked a button in Jenkins.
After a while and a lot of automated tests, we felt more confident in the process so we decided to make the pushing of the deployment button be more automatic. Pull requests would be merged at the request of the QA team when they were ready to test a particular feature or bug fix. This worked well for a bit, but as the development team grew and the number of things to test increased, we found that waiting for a person to merge code in order to test was a new bottleneck. Most of the code coming from this team included unit tests ensuring that most of it would act right, guaranteed. In reality, the majority of the bugs that QA was catching were the odd cases, the one-off edge cases that were unlikely to affect a real customer, because the majority of the happy path and standard bug checks had been automated. So we decided to upgrade the process again.
This time, we decided that we still felt code review was important and necessary (almost always), we felt the unit tests, as well as system integration and behavioral tests, were important, adherence to the coding styles was important, and that API tests all working was important. These became jobs in Jenkins which could run and report back to Bitbucket as well as kicking off additional jobs. At this point, much of our workflow has been automated.
From a developer point of view, we pull a ticket off the top of the backlog and mark it in progress. The code is developed in a new branch off of master (did I mention we switched from Mercurial to git along the way as well? Well, we did that too because it allows for even more flexibility in our workflows). Once the code is done, a pull request is created. This starts automated builds which ensure that all the unit tests are passing and that the code adheres to our coding standards. We used to do a lot of other static analysis as well – building phpdocs, pdepend, and others, but we found that we were not using the results of those programs very often, if ever, and several of them add a significant amount of time to the build. So we evolved our process to remove those. We still run them, but they are in another Jenkins job that doesn’t affect our “critical path” flow. Essentially, if the PHPUnit, phpspec, the javascript unit tests, and phpcs jobs complete successfully, we count that as good. We still get stats from copy paste detection or project mess detection, or code coverage, but those happen off to the side in a separate job.
If everything passes, our Webhooks project awaits indications from Bitbucket that the code has received the proper number of approvals from the right people. We also put in place a check to ensure that all the code has been properly rebased and squashed. If it’s not rebased, the system will let the developer know in Slack that they should rebase, and actually gives them a button that will do it for them. It will not squash, though, the developer must do that on their own. If all those checks pass, the system merges in the code. This starts the next chain of events, deploying to QA automatically, moving the JIRA story from In Progress to In-Review, kicking off API and BDD tests. If all of that passes, then the code is deployed to the Demo and Production environments. All-in-all, it can take less than 20 minutes to get the code from your brain into production. Similarly to the process seen at places like Etsy, new developers on this team can deploy to production on their first day.
All of our teams currently use some version of the approval and build-check => merge process. Each of them differs in how many approvals must be in place in order to merge, along with other team specific approvals by certain individuals. This has all evolved out of what works per-team and it’s set up to be easily changeable. It allows a team that may not yet be ready for a process like “continuous deployment” to work towards that as a goal and get used to the mindset that needs to be in place in order to make that work without causing problems.
A Word on Continuous Deployment
Personally, I am a big fan of continuous deployment. It allows code to move quickly out to customers where you can receive feedback right away. It means you can see the impact your changes make immediately rather than days, weeks, or months later. However, it means that features and stories, and sometimes even bug fixes may need to be broken out into a series of sequential steps, each of which gets you closer to the complete change, but individually can be done with little or no customer-visible impact on the running system. This takes a lot of discipline to do right. Ideally, stories are small and can be done with no impact on any other stories, but in practice, this is not always the case. Let’s take a quick look at an example.
Suppose you need to make a change to the system to allow the database to store time zone information where previously it has been left out. If you’re doing a big bang deployment where there must be coordination and downtime, this is pretty simple - You make the changes to the code, the API, and the database. You script up something to update the database field to use time zones, and you build a query or a script which will run when you deploy. This may mean that there’s downtime on the site depending on how long it takes to convert the data.
If you’re doing continuous deployment, you don’t want this downtime, since deployments can happen at any time, not just when the application is less busy. It requires breaking up the tasks into small ordered pieces that can each run without causing problems. At each point, each individual change can be deployed without the site needing downtime. It may be more work overall, but I feel it’s worth it. Here’s an example of how to approach the same problem with a Continous Deployment mindset. Keep in mind that after each step is done, it would be deployed completely before pulling in and deploying the next step.
- Add a new time zone column to the database
- Script to bulk migrate old non-timezone column to new one
- Triggers or functions to keep writes to old column synced into new column
- Change APIs and code to return the new column, leaving the old one in place
- Change the UI and other clients to use the new column, removing references to the old column
- Change the API to stop returning the old column
- Remove the synchronization code and old column from the database
If these steps are done in this order, then there’s no need for any downtime. If some steps take a long time (like data conversions or addition of new columns) these can take as long as is needed but the site remains up and running. Since nothing relies on the new database pieces until they are fully deployed, we have no problems there. Because we remove any reference to the old stuff from the top down, when it’s removed from each level, there’s nothing relying on it, so there’s no breakage either.
Don’t Stop Looking for Ways to Do It Better
Chances are if you’re doing some or all of these, you may feel that things are pretty darn good. And they are. The process and flow we have at this company is much better than what I’ve had at previous companies. But that doesn’t mean they cannot be better. I challenge you to continue to look at your processes and find ways that you think might make things better and give them a shot. If it doesn’t work, that’s ok, stop doing the things that don’t work or don’t help, and keep doing the things that do work and make things better. If you’re consistent about making these improvements all the time, there’s no practical limit to how good things can be. I often hear from former employees that what we have in terms of process is the best they’ve seen anywhere they’ve worked, both before and since. It’s usually at this point I get an opportunity to tell them all the things we’ve changed since they left and how much better it is now than when they left. Keep improving and leveling up. See you next month.