The Software Development Ceiling

There are two economies that produce software: the Joyous Software Economy and the Commercial Software Economy. The latter of these has vastly more resources than the former. Surprisingly, though, both economies produce software of roughly equivalent quality. Even within the commercial software economy, smaller startups often produce software that is on par with that produced by some of the largest companies in the world. Why is it that organizations with effectively unlimited resources, who are able to attract the best software engineers in the world, can't seem to out compete groups with a fraction of a fraction of their resources? Why is it that people volunteering in their free time can run circles around the most valuable companies in the world?

This doesn't just apply to simple software either. Database management systems provide an illustrative example. By far the most used database management system in the world is SQLite, which is developed by a relatively small team. Postgres and MySQL can compete with the likes of Oracle and SQL Server. Even databases originally developed largely by a single person, like Ben Johnson's BoltDB, find themselves into critical infrastructure, like Kubernetes.

There isn't a single answer as to why this is, but I have a conjecture as to one of the more influential reasons: there is a ceiling to software development and this ceiling is caused by poor information organization.

The Velocity We Wish To Reach

It doesn't take one much time in this industry to realize a recurring theme: people want to rewrite codebases quite often. I don't think I've joined any team and not been told that either the codebase needs to be rewritten or they are actively in the process of rewriting it. It feels like as soon as we push something into production we want to immediately rewrite the thing. It's rare for a codebase to last more than few years before there's pressure to toss the entire thing out and start anew.

But we also know that in a few years we're likely to wind up in the same place. After all, the people who wrote the codebase that we want to rewrite were likely rewriting something or at least planning to write something that would last. So why is it that we keep winding up in the same situation? Why can't we seem to do better?

I think most people would say "technical debt". As a codebase is added to, as new features are added and previous features deprecated, there's a feeling that we've labeled technical debt. Eventually, this debt is overwhelming and hinders progress. Each new feature we add seems to take longer and longer to build; the number of bugs created per feature increases over time.

In the past I claimed that "technical debt" isn't the right phrase and we should instead call it "malpractice". There's still some truth in that, sometimes the things we call technical debt are really just programmers being lazy and not wanting to do things the correct way. But the claim is insufficient in its completeness. There are plenty of times when a team agrees to take on some "technical debt" and the reasoning for such a decision is sound. Quite often it's due to a lack of information required to make a "proper" choice, or a lack of time to implement something in the ideal way.

Even with responsible technical debt, we wind up in the same place again and again. No matter how hard we try, no matter how many different processes to attempt to implement to increase the speed of development over time, no matter how much we focus on simplicity and keeping complexity out of our codebases, we still wind up wanting to rewrite them.

The ideal situation would be that velocity increases over the course of a project. Or at the least we would want the velocity to remain constant. Perhaps there is something else leading to the decrease in velocity over time?

Information Precedes Source Code

At it's core, the process of developing software is turning information into algorithms and then those algorithms into source code. We tend to generate a massive amount of information in this process. Every email that's sent, instant message that's posted, and every conversation we have generates information. Sometimes that information is in the form of a decision, other times it's just background knowledge that we'll use in later decisions.

Information and knowledge are two separate things. Knowledge is our understanding of something, and information is the way that we communicate that knowledge to someone else. Generally, for information to be useful, it must be recorded.¹ Unfortunately, the bulk of information that is generated during the software development process is never recorded, and the little that is recorded is never placed under bibliographic control.²

This actually provides us with our first hint as to why the joyous software economy can compete with the commercial economy so easily. There are four general categories of information generation within software development: emails, instant messages, video conferences, and in person meetings. The joyous economy favors the first two while the commercial economy favors the latter two, with heavy emphasis placed on in person meetings.³ If you place these four on a spectrum as you move from emails to in person meetings the following happen:

The volume of communication increases. People take more time to write emails than they do to write instant messages. While people can type at around 80 words per minute on average, they can speak at roughly double that. In person conversations can be had more readily than video based conversations.
The density of information per communication decreases. Most of us have been in a meeting that should have been an email.
The resources required to record the communication increase. Cataloging email is largely straightforward. Instant messages are more difficult as many tools don't make it easy to extract conversations and the high volume and cross talk means more care is required. Video recordings are large in size and they need to be transcribed before being cataloged. In person meetings are the most intensive, requiring a stenographer to produce a live transcript or extremely diligent note taking.

The reality is that while emails and instant messages are sometimes cataloged (perhaps in a wiki page or within a design document), video calls and in person meetings are almost never recorded. This means that any information that is generated within them must be recreated whenever it is needed. Over time, this becomes more and more difficult, because people's memories fade and sometimes we misremember things.

Eventually, our ability to recreate the information is lost entirely. Sometimes people leave; sometimes we forget. But that information is almost always load bearing. The difference between a feature and a bug is whether we intended for that behavior to be present. That decision is encoded in information. Without that information, it can be difficult to know if some piece of behavior was intended. Not just that, but often the decisions we made in the past were based on information at the time. Incorporating new information along with old might lead to a different choice in the current. Without that old information it's more challenging to change things.

Take for example a choice of a web framework that is used to build a software product. We might choose a web framework that is worse for the particular product we're building, but chosen because the team at the time was most familiar with it. If we were to reevaluate the choice at a later time, perhaps the team now is more proficient with a framework that better fits the product needs. If the information about why the framework was chosen was never recorded, or if the record of that choice was never placed under bibliographic control, then we likely won't come to the right choice in the present. Perhaps it's because someone misremembers why the framework was chosen and thought it was because it was a good fit. Or perhaps in the absence of authoritative information it's decided we're better off sticking with what we have.

In addition difficult decision making, a lack of recorded information also impairs our ability to improve. Knowing what information was used to make decisions in the past is helpful in determining future decisions. Perhaps through analysis of this recorded information, we determine that we need to include our operations team earlier in the design process, because the later they are included the more outages we have. We might have a feeling that this is happening, but it's much better if we can make a more definitive statement based on information that is readily available and can be analyzed.⁴

As time goes on, this lack of recorded and controlled information makes it more challenging to make decisions. We can't answer questions about why a piece of code is written in a particular way, or why we decided to implement something in one way instead of another. We don't know if that piece of compact and confusing code was written for performance reasons or because someone was lazy or because someone was being clever.

This is what we perceive as "technical debt". However, debt isn't the right word here, instead we should be using a word that is distinct from debt, but often confused with it.

It's a deficit, not debt

That word is deficit. In this case, what we have is an "information deficit". As time goes on and our ability to recreate information decreases, we feel that as a lack of something. We want to make a decision, but we lack the information and therefore knowledge, to make it confidently. That feeling of technical debt gaining interest is actually the increase in deficit that occurs the worse we get at recreating information.

Usually the deficit is created because we don't record information at the right point in time. Sometimes we make a decision record, but that's often too late because what we want in the future is not a record that a decision, but everything that led up to it. We don't want to know when code was written, we want to know why it was written. In the moment we make the decision or when we write the code, the reasoning is obvious. This is because we already posses the knowledge, so there's no need for us to render that knowledge into information. However, as time goes on, that knowledge will be lost and we'll need information to rebuild it in our own mind. Additionally, people who were not around when the information was generated won't be able to recreate the knowledge because they never had it in the first place, so they'll need to original information to create the knowledge.

Ultimately, this is an upstream failure: we didn't record the information when it was most easily recordable. The best time to record information is when it's actively being used to build knowledge. That is, we should record the information we used to create the knowledge in the first place, instead of trying to take knowledge and export it into information for some unknown entity in the future. More simply: we need to record the conversations we have and place them under bibliographic control. From there we can refer to those records in subsequent information that we create, forming a chain. Summaries are insufficient here, they rely too heavily on people's memories of the original information which are often times incorrect copies.

This is part of the reason why the joyous economy can compete. With much of the conversations being done in public and on mailing lists, it is rather straightforward to not just place those records under bibliographic control, it is also easy for someone new to read the conversations exactly as they happened and gain the necessary knowledge. In a commercial environment with mostly in person meetings the only way to do this is to sit down with someone who was there or to have another meeting and hope you arrive at the same knowledge.⁵

So we need to record information as it's generated and then place it under control. Is it really that simple? Well no, we need one more crucial step.

Works Cited & Bibliographies

Remember back in school when you were writing an essay or a research paper? You likely had to go through the process of taking each source you used and putting into a bibliography or works cited section at the end of the paper. There was usually a particular format that you had to use. If you went to school in this century then you likely used some online formatter that made this process much simpler. It's also likely that if you forgot to include your bibliography your teacher would fail you, or worse you would be accused of plagiarism. I know I never really thought about why this was such an important thing, I was just annoyed that I had to do it.

While a bibliography is important to ensure that you haven't stolen credit for someone else's ideas, there is another arguably more important reason for including one: it allows us to track and trace the creation of information over time. In isolation, a bibliography seems annoying, but when someone reads your research paper, they might be interested in understanding how you arrived at your conclusions, or a peer reviewing your paper might want to confirm your conclusions match your sources. When everyone includes a bibliography we can trace information through multiple hops and we can also see the impact of any given work. We can essentially build a graph of information.

This means that if we find a problem with an upstream paper, we can see how that might affect downstream papers simply by doing a search of papers that include the upstream paper in their bibliography.

This is what we are missing in software development. There is no way to look at the end product of something, which is the source code in a source control system, and understand what sources it relies on. There are informal ways of doing this, perhaps by including an issue tracker identifier in the commit, but it is rare that issue trackers contain anything near a full reference of information. As a friend remarked to me once, we see our source control systems as fancy backup systems instead of change history systems.

What we want is a version of the bibliography or notes section contained in many books. Usually there is a reference near where a claim is made and we can trace that, through the bibliography, to the original work. From there we can trace further back if we need to. But the only reason this is possible is because we place the sourced material under bibliographic control, which enables us to reliably reference it within our work. And then our work is subsequently placed under bibliographic control so other works can reference it.

What we want is the ability to see a line of code and then trace the references all the back to the information used in the creation that line. Every feature that we write and every improvement we make, should be traceable back to the conversations we had and the decisions made that enabled it.

Git Commit Bibliography

How do we do this? By including a bibliography in each commit we make. Basically, authoritative access points⁶ to information we used when writing the code. It is rare that we write code without having first generated some information about what it is that we want to write. That could be some notes we took while thinking about the code, conversations we had with colleagues, email threads with collaborators, or design documents we wrote. It also might be some standard or specification. No matter what it is, we should be able to cite it when we finally commit the code.

This also would require that commits are atomic. The popular branch and merge commit methodology makes this messy, but luckily we already have tooling for atomic commit style code collaboration.

If you require that each commit cites what information was used in creating the code we can start to reduce the information deficit through backward pressure. As we need to refer to information further upstream, we'll find ways to record it, place it under bibliographic control, and add its upstream sources. This also provides a point to start retrospectives around: you take the commits that were added recently, you trace through the information and determine if you had adequate access information. If you didn't, then you figure out how to change the process so that information is readily available in the future. If a bug is found, you can trace information references to determine how the bug manifested.

It might seem counterintuitive that this would increase the speed of software development. After all there's a bunch of work we need to do to record more information and to then place it under bibliographic control. But that's because we don't tend to think about just how much time is spent recreating information. The plethora of one-on-ones people have to gather knowledge. The time people spend spinning their wheels trying to figure out why something is the way it is. And of course, having to rebuild a piece of software because the information deficit has grown too large. All of these are extremely expensive, and not just in time resources.

Lifting The Ceiling

Placing information under bibliographic control not only enables us to make more informed decisions and build software more rapidly, it can completely obviate the need to rebuild a codebase. If we have confidence in why choices were made, then we can have confidence in making changes. Those haunted parts of the codebase that are early indicators of a looming rewrite won't ever be created. Onboarding new engineers becomes much simpler as they can walk through the codebase and acquire knowledge by reading upstream information. In fact, we can build and maintain manuals and through the citation process understand when individual sections need updating.

Through bibliographic control we can begin to lift the ceiling on software development, raising it to a level where we can at least maintain a consistent velocity over the course of development. This isn't a small task. We currently don't have any of the tooling we'd need to accomplish this. We don't record nearly enough of the information we generate, and even when it is recorded the best we can usually do is toss it into some wiki where it quickly becomes lost.

Thankfully, we aren't alone in this endeavor. There is an entire academic discipline dedicated to precisely this: placing recorded information under bibliographic control. In somewhat of an ironic twist, that academic discipline is actually contained with the science of our very own industry. In the way that biology is the science of the biotech industry, information science is the science of the high tech (also known as information technology) industry. What we build in this endeavor will undoubtedly be of use to the greater information science discipline. All we must do is decide that it is worth it to lift this ceiling.

From The Organization of Information, Fourth Edition, page 5: "It seems to us that we can use our knowledge to write a book, but until you read that book, understand it, and integrate it into your own knowledge, it is just information. That is why we believe we organize information—so that others can find it, read or otherwise absorb it, and use it to add to their own reserve of knowledge." ↩︎
Bibliographic control is another way of saying organizing information. To avoid confusion with more informal ways of organizing information, I'll use the term "bibliographic control" throughout this entry. ↩︎
This is largely the justification for return to office. One can view "in office collaboration" as equivalent to "in person meetings". ↩︎
Some might refer to this as "data". ↩︎
Most of us have experienced that meeting we've had four or five times because we keep forgetting what conclusion we came to in the previous meetings. ↩︎
In the domain of bibliographic control, authority control is, roughly, the use of character strings that precisely identify recorded information. These include things like ISBNs or citations that adhere to a specific style. ↩︎