Input, state and their relationship to bugs
Wednesday, November 14th, 2012
Why are there so many programmers who don’t know what state is, much less the impact it has on programming? Recently I was having a discussion online and some programmers kept misinterpreting everything being said by me and other programmers because they had no or little grasp of the concept of state. This is not the first time I’ve noticed that, for such an important concept, there are a surprising number of programmers who don’t understand the concept and implications of state.
The meaning of state
From Wikipedia:
the state of a digital logic circuit or computer program is a technical term for all the stored information, at a given point in time, which the circuit or program has access to. The output of a digital circuit or computer program at any time is completely determined by its current inputs and its state.
Just to be clear, “all the stored information” includes the instruction pointer, CPU registers/caches and so forth. So if at any given point in time your program is in a while loop or conditional branch, that is also part of the state.
“The output of a digital circuit or computer program at any time is completely determined by its current inputs and its state.“… Think about that for a moment, because it is important. Everything that happens from that moment on in a computer program is determined by the current inputs and its state. That means, as long as there is no other input, the output of the program is completely predictable at any time. That’s a lot (or rather, exactly) like the pseudo-random number generator in most programming languages. If you know the seed, you can predict every random number that will be produced. Each number the pseudo-RNG returns is the result of its input and its current state.
If programs are that predictable, how come there are still so many bugs in them? Well, for one thing, programs are not that predictable at all.
On the Origin of Bugs
Look at the following example:
a = 5 b = 10 c = a + b
Tell me, how many bugs does the program have? If we assume that the syntax is correct and that each variable can hold an integer number, this program contains zero bugs. Now tell me, how many bugs does this program have?
a = int(file('in', 'r').readline()) b = 10; c = a + b;
How many bugs? One? Three? Do you know? Because I sure don’t. Here’s a few I can come up with from the top of my head:
- The file “in” may not exist
- The file “in” may not be readable
- The file “in” may disappear between opening and reading
- The file “in” may be empty
- The file “in” may not contain something that can be cast to an integer
- The system may not be able to allocate more file handles
- The system may not be able to allocate enough memory to read a line from the file
- …
I could go on, but the point should be clear. We changed only one line in our program, and now there are who-knows how many bugs in our program. Why is that? Because in our first example, the state of the program could never be influenced from outside the program at any time, since there was no input! (Yes, the memory might get corrupt from a random flipped bit or the universe might decide to suspend the laws of physics, but again: let’s not get pedantic).
When we write a piece of code, we write that code with certain assumptions. If I write:
while (i < 10) { /* ... */ };
I'm assuming variable 'i' contains some kind of number, which at the start of the loop is smaller than 10, and which during the loop will not suddenly change into, say, an open file pointer. Now if I'm not a total idiot, and I write my code properly, that last thing will never happen. Given a piece of code, it's generally not difficult to make correct assumptions about that code. But making assumptions about input into that code (in this case the contents of variable 'i'), regardless of where it came from (interactive input, the network, the filesystem, passed in as a argument into a function, whatever) is much more difficult and dangerous.
I'm going to make a sweeping generalization: Given that "the output of a digital circuit or computer program at any time is completely determined by its current inputs and its state", and given that a program's current state is determined by its starting state and all the input, and that we can thus disregard the "and its state" part of the first statement... I'm going to claim that
All bugs in a program are a result of the input into that program.
Of course, like I said, that statement is sweeping, and therefor false. But for now, assume it's true and think about the implication of it for a moment, while I correct my previous statement.
Predictable state
Of course, not all bugs in a program are a result of input into that program. Sometimes us programmers screw up, and make mistakes. This is another source of bugs in our programs. Sometimes we're tired, and we make mistakes in the syntax of our programs. Sometimes we don't pay attention, and a certain library works differently than we thought. Such bugs are easily solved. In fact, many can be automatically detected by compilers or asserts or whatever.
Another class of bugs is the one where we make assumptions about the current state of a part of our program, and write new code in accordance with those assumptions. Those assumptions could be correct, resulting in working code, or they may be incorrect, resulting in bugs.
A coworker of mine once had to write some code for an ATM. Since ATMs are reasonably sensitive, he wasn't allowed to dynamically allocate memory. That is, he could not use malloc and was only allowed to use what memory was statically allocated by the compiler. Not allowing dynamically allocated memory gets rid of entire range of problems, all of which are directly related to unpredictable state. Buffer overflows, out-of-bounds array referencing, bugs in string copying, reading input... all those problems are greatly mitigated by not having dynamically allocated memory.
You always want the state of your program to be predictable. Input makes the state unpredictable. As soon as we get input into our program, we can only hope that that input adheres to the assumptions the code makes. The current state may be valid (though unpredictable), however we cannot make any claims about future states. Only after we have sufficiently validated and sanitized any input in such a way that it adheres to the code's assumptions can we speak about a predictable state again.
This is why unclear code is such a problem. Unclear code makes it hard to reason about code, thus hard to determine whether code might lead to unpredictable states. Basically, we want to make the assumptions of code as transparent as possible.
I hope everything I've said so far makes absolutely basic sense. I hope that at this point in the article you're thinking "Well, duh", because from what I've seen of many programmers.. they don't have a clue about the relation between input, state and bugs.
"So what?"
It is important to grasp this basic concept of state because it helps you to reason about why and when certain programming techniques are a good or bad idea.
-
Functions are a good idea because they make it easier for the programmer to reason about input and its effect on state within that function. Easier reasoning about state equals better assumptions about code, resulting in less bugs.
-
Object Oriented Programming is a good idea because it groups together data and the logic that operates on that data, and therefor it allows the programmer to encapsulates state in such a way that it is harder to disrupt that state with outside influences.
-
Using global variables is a bad idea because it screws up all the assumptions code makes about that global variable. Wherever you use a global variable, you cannot reason about the state of your program, since the value of the global variable can change at any time.
-
Allowing publicly settable properties in objects (that includes setter methods, for you Java developers out there) is a bad idea because now any assumptions methods in that object make about the state of those properties can be invalid. Conversely, making the constructor the only part of the object that is allowed to have input is a good idea (even if it's not always practical)
-
goto is considered evil because it makes allows for random changing of the state of a program, and therefor can make it incredibly hard for developers to reason about the current state of the program at any point.
-
functional programming is a good idea because it avoids mutable data as much as possible, and therefor it is hard to bring the program into an undefined state.
-
Multi-threaded programs where threads write to the same memory are a bad idea because writes are non-deterministic which cause unpredictable state. This can be avoided using locking, which in itself is incredibly hard to do properly, leading to state which is difficult to reason about.
-
Smaller functions are a good idea because, again, it makes it easier to reason about the state and the changes in that state within the function.
-
...
I could go on, but I hope you get the point. Please note that these point have to be considered in context. Don't take these as absolutes, because that would go entirely against what I'm trying to say. Goto can most certainly be used to create state that is more predictable and easier to reason about. In general however, it is considered detrimental to predictable state.
Back to input
Lets jump back to input for a moment. The age-old adage goes: "Validate your input!" What does that mean? Quick question.. Given this function:
def foo(a, b): c = a / b return c
How many bugs are in that function? We can immediately spot a potential division by zero bug and an overflow bug. There are others, but the point is you don't know! The input into this function is not validated. We're making assumptions about the calling code which may or may not be true. The key here is that the concept of "input" doesn't just apply to what you read from they keyboard, a file or the network. It applies to every line of code. Programmers who do not realize this are what I like to call Pavlovian Programmers... Sure, you can teach them a trick (goto is bad!), but they'll never understand the reasoning behind that trick, and they won't be able to apply it to new concept. This is why programmers are jumping on the Event-driven programming bandwagon, despite what an insanely bad idea it is. Basically event-driven programming (the kind that binds anonymous functions as callbacks to events. Yes, javascript et al) is a kind of goto, except for being harder to debug.
Ideally, every function and every constructor would validate its input arguments. (Contracts, anyone?) The stricter you validate input arguments, the less chance there is of winding up in an unpredictable state. Of course I say "ideally", because it is not practical to validate each and every bit of input. A decent compromise would be to validate the input of all constructors that are part of library code. Then again, 99% of your code should be library code. But that's a discussion for another time.
Conclusions
The conclusions of this article are simple:
- Prevent bugs
- Write clear code
I'm not trying to be funny here. The conclusions are the same as what you've already heard a thousand times. I just hope that you now view them in a bit of a different light. I believe we can accomplish the above and create much better code if everybody would just realize a few basic concepts:
- It is vitally important to protect the state at all times.
- Minimize "outside" influences on the state. This includes input from files, the network, other threads that are running, global variables, etc.
- The State in a program is unpredictable all the time. Whenever we read input, whenever a function returns or throws an error, our program is in an unpredictable state. The system should, at all times, move from unpredictable and potentially invalid states to a predictable and valid state. This can be accomplished by validating and sanitizing any input into the program.
I leave you with this quote, which I think I once read somewhere else, but cannot find again, so I will attribute to myself:
Programming is the messy art of constantly bringing a program back to a predictable state after receiving a tiny bit of input.