Code health monitoring

The Epidemic of Architectural Tech Debt in Software Codebases

five minutes read Updated: May 15, 2020

The pandemic has taught us so much. Including a better metaphor than “technical debt”. Even after researching a lot about architectural tech debt (ATD), I’m not confident I could explain it to you. Perhaps making matters worse, the term inherits the definition of “tech debt”, which itself is a loaded phrase that is missing a universally accepted definition in the tech world. They say teaching a subject is a good way to learn it, so this is my attempt at explaining ATD.

ATD is the deviation from an ideal software architecture. ATD means that your software’s architecture is not ideal, and this deviation has real business consquences. These consequences manifest as “symptoms”, which are experienced by software developers, delivery staff and many other tech roles. To extend the medical metaphor, these symptoms indicate an underlying condition, like an epidemic disease, which afflicts the codebase until the deviation is resolved. There are two ways of categorizing solutions to this problem:

  1. Prevention: having a well-defined “ideal” and writing code that matches it, so that ATD doesn’t become a problem
  2. Mitigation: The presence and detection of ATD symptoms, to fix the code to match the ideal

Even though the financial metaphor of interest and debt is the most popular, I’m now leaning in to the medical metaphor fully to explain ATD and code quality. The COVID-19 global pandemic has everyone in the world discussing models, which are being used by key decision makers like presidents and prime ministers, which in turn affect entire populations and economies. It’s serious business, and this pandemic is a shared global experience which I can use to help tell this particular story. They say an ounce of prevention is worth a pound of cure. So it’s fitting that the above two “codebase disease” solutions fit into the categories of epidemic prevention and epidemic mitigation.

ATD Prevention

Understanding what the “ideal” software architecture looks like can help developers prevent the problem from appearing in the first place. Code quality policies like code review and other processes can help development teams avoid a lot of the problems of dealing with tech debt and it’s symptoms. Unfortunately, life comes at you fast and the truth on the coding front lines is that writing “ideal” code is not always realistic. Indeed, it’s not even always desirable, as writing ideal code trades off development speed for software quality attributes.

For example, a scrappy startup taking a few shortcuts to hack together their software prototype is a perfectly valid strategy. In this case, speed trumps perfection. Alternatively, a big tech company will invest more into their developer training and processes to ensure the code that is written aligns more with the ideal. These are two ends of a wide spectrum of the speed-quality tradeoff, and one size does not fit all. The decision whether to deviate from the ideal, and to what degree, is made on a case-by-case basis everyday by software development teams. For some software stakeholders, it matters more that the code works, and is ready to launch on time, than how ideal it is.

There are other reasons why relying only on a prevention strategy is an imperfect solution. The developer talent pool which is available to any business at a given time may not contain the level of experience required to write fully ideal code. To further compound the problem, the definition of ideal software is usually subjective. There is the concept of the “ideal smile”, but in that case the definition seems to be fairly rigorous and involves actual health & performance characteristics, such as good jaw alignment. On the other hand, ideal beauty is very subjective and not necessarily linked to functional characteristics. I mention these examples of ideal definitions to demonstrate the difficulty in aiming for an ideal. Even amongst senior, experienced developers, there could be differences of opinion of what consitutes the ideal. The resolution of these differences takes a backseat to other matters. In some cases, the potential for conflict and inconvenience of resolving a philosophica difference in ideals is not worth the trouble. The code needs to work, and the feature needs to be launched. Business value needs to be delivered short-term, while also maintaining long-term quality. This is the dilemna of the modern software engineer.

To make matters worse, to effectively model a “codebase disease” like ATD requires temporal analysis. Time is a factor. Modeling this disease means building estimates of the severity of the symptoms for different points in time of the future. Often, it comes down to someones best guess, gut feeling, or whatever feels right at the time.

ATD Mitigation Response

This category of solution is like the opposite of prevention. It involves dealing with the problem after symptoms have been observed, testing has been performed and ATD has been diagnosed. Once the problem has appeared and grown large enough that symptoms are observable, prevention has failed and the strategy must shift to mitigation.

Prioritization

Like a real epidemic, there are multiple stocks and flows which need to be considered when responding to widespread ATD. Given limited resources, the strategic prioritization of which ATD item in the overall ATD portfolio should be addressed first before others is required. Like a real epidemic, sometimes gathering evidence is expensive and time consuming (not everyone should get a test - only those with symptoms!). Not every ATD item should be addressed immediately. Prioritization is thus considered a key pillar, and a key challenge, in addressing technical debt.

Symptoms of ATD

ATD is like a black hole in physics. You often can’t directly observe it, but you can infer it’s existence by the effects it has on other systems. These effects, in the epidemic metaphor, are called symptoms.

Causes of ATD

The causes of ATD have been well studied in academic literature. These causes are beyond the scope of this article. For now, let’s say that ATD can be caused either intentionally or unintentionally. When compromising against long-term quality for short-term business value, this is an intentional cause tech debt. Unintentionally causes of tech debt include software entropy and disorder caused by natural forces, such as changing requirements or domain knowledge. Point is, it can happen to the best software teams, and sometimes it’s not anybody’s fault.

This natural origin is another similarity to epidemics, which can emerge from nature, like from animals.