The value of the methodical approach

It is probably fair to say that we have all made a 'quick change' during code development at some point in our career. And, at some point, that quick change probably had an immediate, negative effect. Possibly somewhere between a minor negative effect (project could not be compiled) and an expensive negative effect (MOSFETs desoldered themselves, PCB was destroyed). At least you had the chance to detect the failure immediately!

What is worse are the changes that go undetected for so long, your customers discover them. Those changes that make it into your release build. Such problems result in outcomes ranging from having to issue a workaround through to a product recall action. Or perhaps even the destruction of a space orbiter as part of a $330 million space mission.

The quick last minute change before we go home is not a new source of failure restricted to the world of software engineering. In fact, the clever but misguided 'improvement' that causes disaster has been around as long as the discipline of engineering itself. Galileo took the trouble, back in the 17th century to examine and document various engineering failures so that other may benefit from them and avoid the mistakes themselves.

In his book 'Discorsi e Dimostrazioni Matematiche Intorno a Due Nuove Scienze' (Discourses and Mathematical Demonstrations Relating to Two New Sciences), Galileo introduced several stories relating to structural failure of constructions. One of the examples that fascinated me was a story about the temporary dismantling and storage of a marble column so that other construction work could take place in a town. Those responsible for the work were very aware of the complexity of the task that lay before them. The difficulty of dismantling a large stone structure without damaging it had been carefully considered, as had the effects that weather and damp ground could have on the column during the period of storage. As a result, it was decided to set the column upon two wooden beams so that the column did not touch the ground during storage.

It would then seem that someone on the team raised a concern about the placement of the column on just two wooden beams. Would not the weight of the column potentially cause the column to collapse under its own weight and cracking in the middle? After some thought and discussion, it was decided to slide a third wooden beam under the middle of the column 'just to make certain'…

Several months later, in preparation to raise the column and place it back at its original home, the team discovers, to their dismay, a crack in the middle of the column. Precisely over the middle beam inserted at the last minute. The beam that was supposed to help guarantee that no damage occurred had done exactly the opposite.

Further analysis showed that one of the beams closest to the end of the column had sunk into the ground over time. The result: the weight of the column lay on just the end and middle wooden beams.

In his book covering this topic, Henry Petroski notes that Galileo had identified a common flaw in the design process, namely:

"...starting to analyse a problem in the middle and forgetting to go back to the beginning.1"

Bearing in mind that these engineers worked mostly on empirical evidence and rule of thumb, without recourse to the simulation, modelling and measurement resources we have today, they didn't do too badly - at least no-one was hurt. Yet despite centuries of experience to draw upon and countless standards to ensure safety in systems, safety related product recalls still occur today, due to both design errors in hardware and software. And most times, it is the human element that is to blame. As David Blockley stated in his final analysis:

"…all error is human error, because it is people who have to decide what to do; it is people who decide how it should be done; and it is people who have to do it.2"

The point is this; there is enough experience out there to draw upon; and there are plenty of methodologies to follow which are well founded and grounded in years, if not decades, of experience. So if you aren't already using one, now is the time to start.

Happy BugHunting!

References:
1 - Petroski, H. (1994). Design paradigms. Cambridge [England]: Cambridge University Press. Pg 52
2 - Petroski, H. (1994). Design paradigms. Cambridge [England]: Cambridge University Press. Pg 6