Kill It With Fire by Marianne Bellotti

Notes

Ch. 1: Time is a Flat Circle

Systems tend to accumulate the conventions of the group of people who work with it and times they’re built in. Sometimes rewrites are necessary because both of these have changed sufficiently that the old conventions don’t resonate with current set of users.

Ch. 2: Cannibal Code

"Exposure Effect" Familiarity plays an important role in our tastes. It can also drive how certain decisions are made as what’s familiar also can seem appealing. This is also known as the exposure effect.

But familiarity has bounds beyond which it breeds contempt or frustration. An old system written in Java might be associated with bad vibes and may drive towards a new system in a new language even if the problem is not with Java.

Ch. 3: Evaluate Your Architecture

Understand why you're modernizing your system. Be ruthless in identifying the parts that need modernizing. Parts of a legacy system may still work as intended without any additional overhead.

Ch4. Why is it Hard?

We overvalue the hindsight existing systems offer. We think we know much about our existing system, including what it took to build it. But in reality we ignore the role of luck as well as workarounds, deadlines, etc. faced.

It's a challenge to maintain a system - team undergoes churn, decisions get lost or are not documented at all. It's often luck that causes systems to continue functioning.

"Overgrowth" Systems tend to build in assumptions due to context that are not often part of conscious decision making. For e.g. use of Linux operating systems will influence use of certain toolchains. As time passes some of these assumptions will be tested and harder to migrate as the business logic are coupled across these layers. Business logic encoded in Stored Procedures are another example of this. Some tips to handle it: - Shifting vertically: have clear boundaries between vertical layers (hardware, OS, application layer) and build abstracts that allow easier portability. - Shifting Horizontally: Understand how portable the protocols are.

Automation such as transpilers and static analyzers (dependency management, etc.) can help but have limitations that can cause additional work if not taken into account during migration. Use of these tools need follow-up work by humans to make validate.

Ch. 5: Building and protecting momentum

"Advantage of building a new system is that the team is more aware of the unknowns."

Be iterative and incremental in modernizing. Restrict the scope by identifying a subset of features in the original system that target a measurable (and aligned) goal.

Kindle Highlights

Economists have a different explanation for adoption rates of new technology. They typically describe it as the contrast between alignable and nonalignable differences. Alignable differences are those for which the consumer has a reference point. For example, this car is faster than that car, or this phone has a better camera than that phone.

Most of the time this gradual optimization only creates annoyances that play themselves out over social media and eventually die down. Occasionally, there are enough people who have experienced a loss in utility from the optimization that they themselves become a potential market to be captured. That includes consumers who never bought the product in the first place but would have if it had been optimized in some other way. Leveraging alignable differences is pushing the product further away from what those consumers want to buy, but creating an opportunity for another company to figure out.

Note: The author is talking about the role of alignable differences in adoption curves of products. For a product to be successful requires it to resonate with the needs of a group of users. This “need” is in effect an improvement over an existing use case.

Building a network that would scale to cross a single country was itself a significant engineering challenge. In fact, many national computer network projects were attempted during the same period as the internet. The United Kingdom had one; France had two; the Soviet Union had three failed attempts. The United States ultimately prevailed because it was not trying to build a national network; it was simply trying to solve compatibility issues caused by all the proprietary standards computer manufacturers were pushing.

What’s interesting about the internet is that it is the only modern-day communication medium that has been historically flat-rate priced.9 All packets on the internet are billed basically the same way, regardless of what they are or where they are going.10 By contrast, you pay more when you call long-distance versus placing a local call, or you pay more when connecting to a cell network in a foreign country versus your own.

Changing technology should be about real value and trade-offs, not faulty assumptions that newer is by default more advanced.

Technology is like that. It progresses in cycles, but those cycles occasionally collide, intersect, or conflate. We are constantly borrowing ideas we’ve seen elsewhere either to improve our systems or to give our users a reference point that will make adopting the new technology quicker and easier for them. Truly new systems often cannibalize the interfaces of older systems to create alignable differences.

This is why maintaining technology long term is so difficult. Although blindly jumping onto new things for the sake of their newness is dangerous, not keeping up to date is also dangerous. As technology advances, it collects more and more interfaces and patterns. It absorbs them from other fields, and it holds on to historic elements that no longer make sense. It builds assumptions around the most deeply buried characteristics. Keep your systems the way they are for too long, and you get caught trying to migrate decades of assumptions.

In psychology, the term for this is the mere-exposure effect. Simply being exposed to a concept makes it easier for the brain to process that concept and, therefore, feels easier to understand for the user.

If you want proof that adoption is influenced by shared knowledge among networks of people and not strictly merit, consider this: the organizations that are trying to replace their old COBOL applications today are not migrating them to what would be the first choice for data processing among modern programming languages, which is Python, but to the language that has inherited COBOL’s market of a common language for businesses, which is Java.

Fred Brooks coined the term second system syndrome in 1975 to explain the tendency of such full rewrites to produce bloated, inefficient, and often nonfunctioning software. But he attributed such problems not to the rewrites themselves, but to the experience of the architects overseeing the rewrite. The second system in second system syndrome was not the second version of an existing system, it was the second system the architect had produced. Brooks’s feeling was that architects are stricter with their first systems because they have never built software before, but for their second systems, they become overconfident and tack on all kinds of flourishes and features that ultimately overcomplicate things. By their third systems, they have learned their lesson.

In all likelihood, you’re dealing with one or more of the following issues: technical debt, poor performance, or instability.

Note: Or political reasons.

Another useful exercise to run when dealing with technical debt is to compare the technology available when the system was originally built to the technology we would use for those same requirements today. I employ this technique a lot when dealing with systems written in COBOL. For all that people talk about COBOL dying off, it is good at certain tasks. The problem with most old COBOL systems is that they were designed at a time when COBOL was the only option. If the goal is to get rid of COBOL, I start by sorting which parts of the system are in COBOL because COBOL is good at performing that task, and which parts are in COBOL because there were no other tools available. Once we have that mapping, we start by pulling the latter off into separate services that are written and designed using the technology we would choose for that task today.

No changes made to existing systems are free. Changes that improve one characteristic of a system often make something else harder. Teams that are good at legacy modernization know how to identify the trade-offs and negotiate the best possible deal. You have to pick a goal or a characteristic to optimize on and set budgets for all other characteristics so you know how much you’re willing to give up before you start losing value.

Large problems are always tackled by breaking them down into smaller problems. Solve enough small problems, and eventually the large problem collapses and can be resolved.

The longer the new system takes to get up and running, the longer users and the business side of the organization have to wait for new features. Neglecting business needs breaks trust with engineering, making it more difficult for engineering to secure resources in the future.

In poker, people call it resulting. It’s the habit of confusing the quality of the outcome with the quality of the decision. In psychology, people call it a self-serving bias. When things go well, we overestimate the roles of skill and ability and underestimate the role of luck. When things go poorly, on the other hand, it’s all bad luck or external forces.

One of the main reasons legacy modernization projects are hard is because people overvalue the hindsight an existing system offers them. They assume that the existing system’s success was a matter of skill and that they discovered all the potential problems and resolved them the best possible way in the process of building it initially. They look at the results and don’t pay any attention to the quality of the decisions or the elements of luck that produced those results.

By challenging my team to design a system with the same requirements of our legacy system using only technology available at the time the legacy system was built, we’re forced to recover some context. Many of the “stupid” technical choices from the legacy system seem very different. Once forced to look directly at the context, we realize how innovative some of those systems really were. This gives us a little insight into which decisions were skill and foresight and which were luck.

Note: This is a neat trick to also avoid a bit of the Chesterton's Fence issue.

In Moravec’s own words, “It is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”

We assume that successful systems solved their core problems well, but we also assume things that just work without any thought or effort are simple when they may in fact bear the complexity of years of iteration we’ve forgotten about.

Note: Similar to anecdote about Picasso saying it took him his lifetime to draw that simple figure

The older a system is, the more likely the platform on which it runs is itself a dependency. Most modernization projects do not think about the platform this way and, therefore, leave the issue as an unpleasant surprise to be discovered later.

You might be tempted to think that modern software development is improving this situation. Cross-compatibility is much better than it used to be, that’s true, but the growth of the platform as a service (PaaS) market for commercial cloud is increasing the options to program for specific platform features. For example, the more you build things with Amazon’s managed services, the more the application will conform to fit Amazon-specific characteristics, and the more overgrowth there will be to contend with if the organization later wants to migrate away.

The funny thing about big legacy modernization projects is that technologists suddenly seem drawn to strategies that they know do not work in other contexts. Few modern software engineers would forgo Agile development to spend months planning exactly what an architecture should look like and try to build a complete product all at once. And yet, when asked to modernize an old system, suddenly everyone is breaking things down into sequential phases that are completely dependent on one another.

Assuming you fully understand the requirements because an existing system is operational is a critical mistake. One of the advantages of building a new system is that the team is more aware of the unknowns.

Measurable problems create clearly articulated goals. Having a goal means you can define what kind of value you ?expect the project to add and whom that value will benefit most. Will modernization make things faster for customers? Will it improve scaling so you can sign bigger clients? Will it save people’s lives? Or, will it just mean that someone gets to give a conference talk or write an article about switching from technology A to technology B?

Good modernization work needs to suppress that impulse to create elegant comprehensive architectures up front. You can have your neat and orderly system, but you won’t get it from designing it that way in the beginning. Instead, you’ll build it through iteration.

But how does one identify a good measurable problem? The easiest candidates are ones that reflect the business or mission goals of the organization. If you’re thinking about rearchitecting a system and cannot tie the effort back to some kind of business goal, you probably shouldn’t be doing it at all.

the number-one killer of big efforts is not technical failure. It’s loss of momentum. To be successful at those long-term rearchitecting challenges, the team needs to establish a feedback loop that continuously builds on and promotes their track record of success.

I usually start meetings by listing the desired outcomes, the outcomes I would be satisfied with, and what’s out of scope for this decision. I may even write this information on a whiteboard or put it in a PowerPoint slide for reference.

A quick trick when two capable engineers cannot seem to agree on a decision is to ask yourself what each one is optimizing for with their suggested approach.

When there’s a history of failure, that first step has to provide enough value to build the momentum necessary to be successful.

For example, if migrating from a monolith to services, you might want to use the new feature to identify the first service to peel off.

If a project is failing, you need to earn both the trust and respect of the team already at work to course-correct. The best way to do that is by finding a compounding problem and halting its cycle.

In general, the level of abstraction your design has should be inversely proportional to the number of untested assumptions you’re making.

A silver lining in fixing something that is not broken can be found in treating the fix as an opportunity to experiment with and improve engineering practices.

Now, imagine that we started the conversation by telling the team we would give them points for coming up with solutions that used a specific piece of technology.

A good rule of thumb is questions that begin with why produce more abstract statements, while questions that begin with how generate answers that are more specific and actionable. Think about how your answer would be different if the follow-up were “What are the best tools for the job?” versus “How do you know these tools are the best for the job?” You might list a bunch of common solutions in the answer to the first question, convinced that they are good because they are popular. You are more likely to describe your various experiences with the tools you actually use when asked the second question.

This exercise asks team members to map out how much they can do on their own to move the project toward achieving its goals. What are they empowered to do? What blockers do they foresee, and when do they think they become relevant? How far can they go without approval, and who needs to grant that approval when the time comes?

Conway’s observations are more important in the maintaining of existing systems than they are in the building of new systems. Organizations and products both change, but they do not always change at the same pace. Figuring out whether to change the organization or change the design of the technology is just another scaling challenge.

It stopped when the organization hired engineering managers who developed a career ladder. By defining what the expectations were for every experience level of engineering and hiring managers who would coach and advocate for their engineers, engineers could earn promotions and opportunities without the need to show off.

Note: Not sure if a ladder with well defined roles helps unless the roles acknowledge and reward “dirty work” and not just new innovative stuff.

Therefore, when an organization provides no pathway to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.

Note: If the pathway doesn’t recognize simplifying / refactoring as promotion worthy then the problem still remains.

“It is an article of faith among experienced system designers that given any system design, someone someday will find a better one to do the same job. In other words, it is misleading and incorrect to speak of the design for a specific job, unless this is understood in the context of space, time, knowledge, and technology.”

Note: John Conway in 1968!

Our perception of risk cues up another cognitive bias that makes rewrites more appealing than incremental improvements on a working system: whether we are trying to ensure success or avoid failure. When success seems certain, we gravitate toward more conservative, risk-averse solutions. When failure seems more likely, we switch mentalities completely. We go bold, take more risks.

If you want to deter crime, increase the perception that the police are effective, and criminals will be caught. If you want to incentivize behavior, pay attention to what behaviors get noticed within an organization.

What colleagues pay attention to are the real values of an organization.

Given a choice between a monetary incentive and a social one, people will almost always choose the behavior that gets them the social boost.

Italian researchers Cristiano Castelfranchi and Rino Falcone have been advancing a general model of trust in which trust degrades over time, regardless of whether any action has been taken to violate that trust.9 People take systems that are too reliable for granted. Under Castelfranchi and Falcone’s model, maintaining trust doesn’t mean establishing a perfect record; it means continuing to rack up observations of resilience. If a piece of technology is so reliable it has been completely forgotten, it is not creating those regular observations. Through no fault of the technology, the user’s trust in it will slowly deteriorate.

Note: Reminds me a bit of another essay. Not sure if it’s “problem detection” https://www.dropbox.com/s/5lze3y03d5t3xrk/Problem_detection.pdf?dl=0

The author talks about risk boundaries and awareness of our proximity to it and how we might not be aware until we are very close to it or crossed it. Much away from it causes complesancy to set it.

Google has repeatedly promoted the notion that when services are overperforming their SLOs, teams are encouraged to create outages to bring the performance level down.10

The idea that something is more likely to go wrong only because there’s been a long gap when nothing has gone wrong is a version of the gambler’s fallacy. It’s not the lack of failure that makes a system more likely to fail, it’s the inattention in the maintenance schedule or the failure to test appropriately or other cut corners.

How to do something should be the decision of the people actually entrusted to do it. For that reason, the diagnosis-policy-actions approach is too detail oriented to help a team manage up. If the set of actions needs to be changed later, the team might be reluctant to do it and seem inconsistent in front of senior leadership.

My favorite way of marking time is bullet journaling. I have a book where every day I write down five things I am going to work on and how long I think they will take. Throughout the day, I check off those tasks as I complete them and jot down little notes with significant details. During slow periods, I often doodle in the margins or decorate pages with stickers I’ve gotten from vendors.

Just as humans are terrible judges of probability, we’re also terrible judges of time. What feels like ages might only be a few days. By marking time, we can realign our emotional perception of how things are going. Find some way to record what you worked on and when so the team can easily go back and get a full picture of how far they’ve come.

Marking time is more effective because of the more complete a picture it paints of one specific point in people’s lives. Knowing that such-and-such ticket was closed on a certain day doesn’t necessarily take me back to that moment. Sometimes a date is just a date. When you mark time, do so in a way that evokes memory, that highlights the feeling of distance between that moment and where the team is now.

Remember that we tend to think of failure as bad luck and success as skill. We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons. In reality, success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure.

The value of the postmortem is not its level of detail, but the role it plays in knowledge sharing. Postmortems are about helping people not involved in the incident avoid the same mistakes. The best postmortems are also distributed outside the organization to cultivate trust through transparency.

If there’s an organization whose success you want to copy, spend a couple weeks interviewing people about their strategy using the postmortem’s key questions. What went well? What could have gone better? Where did you get lucky?

Our war room was successful because it shortened the distance those conversations had to travel.

Because changes to the system’s usage are hard to anticipate, they are hard to normalize. This is an advantage. When we normalize something, we stop thinking about it, stop factoring it into our decisions, and sometimes even forget it exists.

Programs in financial institutions that must calculate out interest payments 20 or 30 years into the future act like early warning detection systems for these types of errors.

managing deteriorations comes down to these two practices: If you’re introducing something that will deteriorate, build it to fail gracefully. Shorten the time between upgrades so that people have plenty of practice doing them.

Automation is more problematic when it obscures or otherwise encourages engineers to forget what the system is actually doing under the hood.

The secret to building technology “wrong” but in the correct way is to understand that successful complex systems are made up of stable simple systems. Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.

Bad software is unmaintained software.

People do not mute feedback loops because they do not care. They mute feedback loops because human beings can hold only so much information in their minds at one point.

Throughout this book, I have emphasized thinking about modernization projects not in terms of technical correctness but in terms of value add because it re-establishes the most important feedback loop: Is the technology serving the needs of its users?

The organizations that accomplish this ultimately understand that the organization’s scale is the upper bound of system complexity. Systems that are more complex than the team responsible for them can maintain are neglected and eventually fall apart.