How to Handle Outliers in Data: The Path to Improved Predictability
To keep or discard, that’s the dilemma in handling data outliers. Should they be removed?
A client recently asked, “Would eliminating these outliers make our forecasts better?”
Instead of giving you a simple “yes” or “no,” let’s break it down so that next time you ask yourself the same question, you’ll know why and how to improve your forecasts.
Is it advisable to eliminate data outliers when forecasting delivery times? What are these outliers, and how significant are they? They don’t quite fit, so should you just ignore them?
Not so fast my friend.
Here is the thing. Those data points, even though they took much longer to complete, are still part of the work. Disregarding them could significantly affect your capability to produce reliable delivery forecasts.
How to Identify Outliers in Your Data
In essence, outliers are data points that stand out significantly from the typical pattern in your dataset.
Detecting them is straightforward when using the Cycle Time Histogram, a visual tool that shows how long it takes to finish tasks in your workflow. On this chart, the horizontal axis represents the cycle time, while the vertical axis displays the number of tasks completed within that same cycle time.
Any task that takes longer to complete than 98% of the others is considered an outlier.
Let’s examine the example below:
In this diagram, you can see an item on the far right that took considerably more time to complete than any other task—it was finished in 108 days. This is an outlier.
At Nave, we track cycle times by monitoring the card activities on your board. So, when a work item’s cycle time is 108 days, it means the time between entering and exiting your workflow was 108 days. Essentially, it took 108 days to deliver that specific task.
Here’s another case of an outlier in this dataset:
This work item required 81 days to complete. Although it may not be the most time-consuming one, it is positioned after the 98% percentile, qualifying it as an outlier.
What’s the Best Way to Handle Data Outliers?
What happens if you eliminate outliers from this dataset? Essentially, you shift the distribution of cycle times to the left, leading to a decrease in percentile values.
Deleting outliers might make your numbers look better, but it won’t make your predictions more accurate!
Here’s my point: these outliers actually happened. And if you don’t take action to address the issues behind them, they are likely to happen again.
So, the first step is to understand what these outliers mean. To do that, you need to keep them in your data.
As a quick note, it’s important to remember that removing data points is only advisable when they result from errors, and that’s valid regardless of whether they’re outliers or not.
Given that your analytics’ outliers are tied to your board activities, even when errors occur, such as cards not being updated on the board, it still reveals an opportunity for improvement.
To be clear, if the data points in your dataset aren’t errors, they should never be removed.
How to Shift the Conversation Altogether
Instead, what you should do is analyze the outliers in your system and take action accordingly, so that you don’t make the same mistakes again.
One piece of advice I can give you, which served me so well in our journey towards sustainable predictability, is to perceive shortcomings as opportunities for improvement.
Behind every failure lies a lesson to be learned. The more you embrace it, the faster you’ll improve your performance.
You need to know what the outliers are. Why did they appear? Ask the question: “What happened that this particular piece of work took an unpredictably long time to be delivered?”.
Each and every outlier should be analyzed from this perspective. Revealing the obstacles that hinder your delivery times and tweaking your management practices accordingly is exactly what will enable you to eliminate the outliers.
To make your forecasts more accurate, focus on making your workflows more predictable.
This means adopting management principles and practices that lead to consistent delivery results. Spoiler alert! An essential part of this journey is incorporating the evaluation and resolution of data outliers.
Here’s your actionable step: Set up a routine for regularly analyzing your Cycle Time Histogram (e.g., monthly, after deliveries, etc). During the analysis, do two things: build up on the cycle time of the previous period and measure cycle time separately for the new cycle. This way you can compare how the improvements you introduced affect your cycle time distribution.
Are you still encountering outliers? If so, do they share the same underlying causes, or are new challenges emerging? How have the trends of your cycle time built since your last analysis? Do you observe improvement?
And if you’re struggling to improve your performance and you’re willing to explore the proven roadmap to building predictable workflows, I’d be thrilled to welcome you to our Sustainable Predictability program.
Eliminating outliers without grasping their underlying causes isn’t a prudent strategy. Regularly analyze each outlier and adjust your management practices to prevent their recurrence.
This approach not only improves forecasting accuracy but also brings you closer to achieving sustainable predictability.
That’s all for today my friend. I’ll catch up with you next week, same time and place, for more managerial insights. Have a productive week ahead!
Meet the Author
Sonya Siderova is a passionate product manager and a driving force behind Nave, a Kanban analytics suite that helps teams improve their delivery speed through data-driven decision making. When she's not catering to her two little ones, you might find Sonya absorbed in a good heavyweight boxing match or behind a screen crafting a new blog post.