Should You Remove Outliers from Your Data Set When Forecasting Your Delivery Times?
To toss or not to toss, that is the question. At least, it certainly is when it comes to outliers in your data. Is it okay to remove outliers from your data set?
The other day, a client asked me: “Do you think removing the outliers from our analytics could help us improve the accuracy of the forecasting process?”.
Rather than giving you a simple ‘yes’ or ‘no’ answer, I thought it would be useful to break this question down. That way, if you ask yourself that same question, you can firstly, make sure you have the right motivation in place and secondly, have some actionable insights on how to improve your forecasting process.
Should you remove outliers from your data set when forecasting your delivery times? What are outliers really? And does it matter what you do with them? They clearly don’t fit, so can you just throw them away and carry on?
Not so fast there.
Here’s the bottom line: outliers still represent completed work items. They’re just items that have taken much longer to complete than the rest of the work you’re handling. And if ignored, they have the potential to seriously hinder your ability to make reliable delivery forecasts.
What Do Outliers in Your Data Really Mean?
Essentially, outliers are data points floating way off from the trend, or the pattern, or wherever else the other data points are hanging out.
And you can easily spot them in the Cycle Time Histogram.
The Cycle Time Histogram shows the frequency distribution of the completion times of the tasks in your workflow. The horizontal axis shows your cycle time and the vertical axis displays the number of tasks that you delivered with the same cycle time.
Let’s analyze the following example:
The item on the far right of the diagram took more time than any of the other completed tasks. It was finished in 108 days. It is an outlier.
Here at Nave, we track your cycle times by processing the card activities on your board. So, if a work item has a cycle time of 108 days, what this means is that the elapsed time between the moment the card entered your workflow and the time when it exited it is 108 days. In practice, you needed 108 days to deliver that particular work item.
Here is another instance of an outlier in this data set.
This work item needed 81 days to be finished. And, even though it is not the most time-consuming one, it still has taken longer than 98% of the rest of the work items to be delivered, so it also classifies as an outlier.
The Role of Outliers in Your Forecasting Process
Now, let’s take a step back to indicate why the outlier conversation actually matters and what role outliers have when you’re forecasting the delivery times of your work items.
Let’s say that we have a new item and we need to know when it will be done.
The percentile lines on the Cycle Time Histogram show the probability of tasks being completed within a certain cycle time. Higher percentile lines indicate a higher likelihood of delivering on your commitments.
In this scenario, we know that we can finish any type of work in less than 10 days and there is an 85% certainty that we’ll keep that promise.
However, there is still a 15% chance that we can end up in anything between 10 days and 108 days. This is fragile.
And this analysis is especially important when you’re forecasting items with a high cost of delay. For example, if you commit to delivering an item with a Fixed Delivery Date class of service, you want to have 95%, even 98% certainty that you’ll meet the deadline.
Should You Remove Outliers from Your Data When Forecasting Your Delivery Times?
What will happen if you remove the outliers from this dataset? Essentially, you left-shift your cycle time distribution and, as a consequence, the values of each percentile go down.
Removing outliers will help you make the numerical results look better, certainly, but it will not help you to improve the accuracy of your forecast.
Here’s the thing. The outliers actually happened. These are actual use cases in your own business context and if you don’t take action to resolve the problems behind these outliers, they will inevitably happen again.
So, the first thing that needs to happen is for you to understand what they are. And for that purpose, you need to keep them visible, within your own data.
Just a side note here. The only scenario where it would make sense to remove data points is if they were caused by an error (regardless of whether they are outliers or not). However, since the outliers on your analytics are determined based on the activities on your board, chances are, not keeping your board up to date is an opportunity for improvement in itself.
Now, to make this explicit, if the points in your data set are not errors, you should never remove them!
How to Improve Your Forecasting Accuracy (Without the Need to Remove Outliers)
Instead, what you should do is analyze the outliers in your system and take action accordingly, so that you don’t make the same mistakes again.
One piece of advice I can give you, which served me so well in our journey towards sustainable predictability, is to perceive shortcomings as opportunities for improvement.
Behind every failure lies a lesson to be learned. The more you embrace it, the faster you’ll improve your performance.
You need to know what the outliers are. Why did they appear? Ask the question: “What happened that this particular piece of work took an unpredictably long time to be delivered?”.
Each and every outlier should be analyzed from this perspective. Revealing the obstacles that hinder your delivery times and tweaking your management practices accordingly is exactly what will enable you to eliminate the outliers.
In order to improve the accuracy of your forecasts, your main focus should move towards optimizing your workflows for predictability.
It should move towards introducing the management principles and practices that will enable consistent delivery results. And integrating the process of evaluating and resolving the outliers in your data is a foundational step in this process.
Here is your action item: Analyze your Cycle Time Histogram regularly, (e.g. once a month, after delivery, etc). During the analysis, do two things: build up on the cycle time of the previous period and measure cycle time separately for the new cycle. This way you can compare how the improvements you introduced affect your cycle time distribution.
Do you still observe outliers? If you do, do they have the same root cause as before, or are there new challenges to be handled? How have the trends of your cycle time built since your last analysis? Do you observe improvement?
And if you’re struggling to improve your performance and you’re willing to explore the proven roadmap to building predictable workflows, I’d be thrilled to welcome you to our Sustainable Predictability program.
Simply removing outliers from your data set without understanding the root cause behind them doesn’t make sense. Instead, analyze each and every one of them regularly. Then, tweak your management practices to make sure you don’t end up in the same situation again.
This approach will not only improve the accuracy of your forecasting process, but it will also help you take one step further towards achieving sustainable predictability.
Meet the Author
Sonya Siderova is a passionate product manager and a driving force behind Nave, a Kanban analytics suite that helps teams improve their delivery speed through data-driven decision making. When she's not catering to her two little ones, you might find Sonya absorbed in a good heavyweight boxing match or behind a screen crafting a new blog post.