A taxonomy of software defects

Defects play a significant role in software development. Those little critters keep crawling back into our beautifully crafted products.

When I was a young programmer, each new defect felt like a new adventure, a new monster to be slain. After all those years doing this job, all defects look like all old acquaintances.

Recognizing different defect types is crucial in my software development routine.

A quick online search for "software bug taxonomy" yields to broad classifications, like functionality, performance, usability, security, compatibility. These groupings primarily address the type of expectations impacted by the nonconformance.

Such taxonomies can help teams to manage the expectations of users or stakeholders, regarding defect resolution timelines and mitigation strategies.

Management often responds to defects based on their perceived impact severity. Benign defects may be disregarded, while even minor errors resulting in significant financial or credibility losses can provoke exaggerated corrective actions.

Yet in my opinion, focusing on the consequences of defects is insufficient for improving defect prevention practices and tools. My own approach leans more towards identifying the nature of the original developer mistake.

Whether a bug affects minor functionality or poses a significant risk, the tools I use to prevent this defect from reaching the user remain consistent, and allowing any bug to slip through reflects just as badly on my development quality practices.

So how do we proceed to classify a defect ?

We play a sort of little "guess who ?" game with the defect, asking a series of questions to pinpoint its type.

This idendification can guide our reponses to the defect, and help determine which tools could effectively catch similar issues in the future.

Guess who ?

Does the code work as expected, in the expected context?

Our code is basically an algorithm expecting to be executed in a specific context.

When coding, we anticipate the possible execution contexts and the expected results.

Yet, ensuring code behaves precisely as intended isn't always a given. Depending on language and tools used, it might even be quite an accomplishment in itself.

Are the contracts between code units respected?

Code units establish request/response contracts with their client code.

Some function might be unhappy with the way our code units are calling it ? Maybe we respected the request contract, and the function still gave us an unexpected response ?

Contract bug

Usually, when contracts aren't adhered to, execution results in errors or unexpected behavior. I call this type of bug a contract bug.

Examples:

The most basic of basic bugs.

Resolving contract bugs involves ensuring adherence to the contract between code units, either by adjusting the contract of the provider or updating the client code accordingly.

Algorithm bug

Our code units interact smoothly, but there may be errors within our algorithms.

Examples:

A slightly more evolved bug but still pretty basic. What I call "stupid programmer mistakes". We tend commit these mistakes at a rate of a dozen per minute, particularly when the developer is sober.

The solution lies in implementing the correct algorithm (uh).

Is the context conform to expectations ?

So first, congratulations are in order. We actually wrote code that behaves as we think it does, in the context we expect. If we're writing code in C++ or Javascript, we're already part of the elite.

However, we still need to classify our defect.

Could it be that the context in which our code executes doesn't align with our expectations ?

Is the context valid?

It's possible our code executed in an unexpected context, but that does not mean the context is at fault.

Maybe we didn't think of all the legitimate real world scenarios ?

Incomplete algorithm bug

This type of bug, which I call an incomplete algorithm bug, arises when the context is in an unforeseen yet legitimate state, causing our code to fail.

Example:

The remedy is simple: adapt our code to function seamlessly within this legitimate context.

Validation bug

Much of our code context originates from external sources such as user inputs or external services. It's impossible to control every input or foresee all potential data from these sources.

Sometimes the problem comes from invalid data infiltrating our execution context.

Perhaps the user entered incorrect information into a form, or an external service provided a response that doesn't adhere to the expected schema ?

Regardless of the cause, the problem lies in our code failure to validate its context before execution.

Examples:

The solution is to prevent our code from being invoked with invalid values.

State machine defect

Sometimes no external sources are responsible for an unexpected context.

Consider our code as vaste state machine: each code unit takes the current state (context) and produces a new state (context with updated information).

Every state transition might be correctly implemented, but our state machine might still be flawed.

In essence, while our code units function in isolation, they don't play nice together.

They contradict each other, leading to issues such as:

Examples:

The solution lies in redesigning the state machine to either prevent unexpected states/transitions or ensure compatibility across all code units.

Is there a problem with time ?

So our code executes correctly in the expected context, validates its external inputs, and the state machine is flawless.

What can possibly still go wrong ?

Well, one of the biggest enemies of developers is Time itself.

Did things not happen in the expected order ?

With Time, the problems usually fall into two categories:

Scheduling defect

Despite a well-designed state machine, transitions may not occur in the expected order.

This can lead to two scenarios: either our code rejects a transition as invalid (we're lucky), or the transition executes but leaves the context in an invalid or corrupted state (more likely).

Example:

The solution is to make your code robust to out-of-order events. You can either explicitly reject them to protect the context or find a way to accept them out of order.

Time performance defect

If everything occurs in the right order and we still have a problem with time, the only explanation left is that our code takes too damn long to execute.

Most likely this results in user frustration due to sluggishness rendering the product unusable. But it could also result in nasty errors as some calls to external services time out, and our application crashes under the weight of delayed processing.

Example:

The solution is to either improve your code's performance or make it resilient to poor performance.

Cost performance defect

While our code may provide value to users, sometimes the cost of operations outweighs the price your users are willing to pay for your services.

Example:

The solution is to reduce operation costs, which may involve:

Alternatively, we could increase our own prices, ideally after trapping customers into your product.

Usability defect

Our code functions, but it's either unknown to most users or they struggle to use it effectively, resulting in it not being utilized as much as it could be, if at all.

The solution likely involves improving the design of your product or enhancing communication to ensure users understand its capabilities and how to use it effectively.

Product knowledge defect

The good news is, our code actually works.

The bad news is, nobody cares.

We did everything perfectly except understanding the users' expectations.

Some users may use the feature once and never return, while others keep requesting similar features without realizing they already exist within your product.

To address this issue, we need to revisit understanding what our users truly need and ensure that our product aligns with those needs effectively.

Safety defect

Product safety encompasses a broad range of defects, including issues related to privacy, data/work loss, and physical safety.

Privacy: Did our product just leak our users' personal data online?

Data/work loss: Can our users lose hours of work and personal data with a single click, without confirmation ? (While this could be classified as a usability problem, data loss often requires significant effort to fix and can have a severe psychological impact on users.)

Physical safety: While not applicable to all software products, some may pose physical risks to users due to unsafe design, leading to potential harm or undue stress.

No problem

Well if you're here you have no problem. Why are you here ?

Complete questions tree

Complete questions tree

Defects taxonomy

The final list of defects looks like this:

Defects list

The intentional switch from "bugs" to "defects" underscores an important distinction:

Bugs are trivial issues that should ideally never reach the user. Their presence in the product reflects a software development malpractice. Unfortunately a lot of experimented developers still deliver trivial bugs in their day to day work.

Defects, on the other hand, represent significant nonconformance problems. They come with their own complexity and require constant prevention efforts. While they are never entirely avoidable, many practices can significantly reduce their occurrence rate.

Not all defects have the same complexity, nor the same nonconformance or fixing costs

It's crucial to acknowledge that not all defects are created equal in terms of complexity, or the costs associated with fixing or avoiding them.

Contract bugs are frequently encountered during development, but they typically cost next to nothing to address. With robust development tools and practices, these bugs are often immediately caught, and their resolution is straightforward. In fact, most of them are glaringly obvious if developers take the time to run their code before delivery.

State machine defects, on the other hand, are often identified during peer reviews. Developers who are unfamiliar with a code unit or lack sufficient quality practices may inadvertently make modifications that disrupt other parts of the state machine. Addressing these defects often requires intervention from an expert to explain the issue, making them more complex and costly to resolve compared to contract bugs.

Product knowledge defects are commonly encountered in product teams, and they may not be discovered until long after the feature is delivered to the user. These defects can be incredibly costly to fix, and in some cases they may even be impossible to rectify, remaining in the product indefinitely.

In essence, each main category of defects implies an order of magnitude difference in terms of complexity and cost. For instance, algorithm bugs are typically ten times easier to resolve than state machine defects, which are themselves ten times easier than time defects, and so on. Understanding this hierarchy can help prioritize efforts to prevent and address defects effectively.

Costs

Impact on tools returns on investment

Each tool comes with its own setup and operational costs, and it's essential to consider these factors when assessing their effectiveness.

Defect prevention tools should offer a good return on investment based on both their cost and their ability to catch defects. Here's how the equation typically works:

Ultimately, the best tools strike a balance by having low operational costs relative to the costs of the defects they catch.

How to use this taxonomy

At a glance, this defect taxonomy serves as a comprehensive overview of potential challenges in day-to-day software development.

Software development inherently entails complexity (even when the developers don't add unrequired complexity themselves), spanning various levels of conceptual intricacy, from simple algorithm bugs to profound misunderstandings of user expectations, encompassing integration and performance issues.

Selecting effective defect detection tools

Software developers have access to a range of tools for detecting and preventing defects, each with its strengths, costs, and scope of application.

By categorizing the defects encountered during development, it becomes possible to pinpoint the most suitable tool for prevention.

Reflect on why current tools failed to identify certain defects. Is there a gap in our toolkit that requires a new tool ? Have we overlooked a crucial step in utilizing existing tools effectively ?

Assessing our quality approach

Furthermore, this taxonomy facilitates an evaluation of our current delivery quality.

Recognize that not all defects are equal; while some are trivial and should never reach the final product, others demand significant design considerations or stem from broader product design practices beyond our immediate control.

Conclusion

Next I'll look at some of you defect prevention tools and see how we can use this taxonomy, and the Defect Evaluation Grid, to evaluate the return on investment of each tool.

Published: 2024-03-04

Tagged: software quality

Archive