I am currently working on LLM-based BDD-style tests and have been interested in testing since a long time. Here I want to lay some theoretical groundwork, the absolute minimum, to refer back to later on, when I want to talk about other testing matters.

NES Test Station & SNES Counter Tester [commons]
All testing of functional requirements in software engineering comes down to one principle: testing a function. Different apparent forms of testing exist merely for practical reasons. Everything else on this topic simply means spelling out the implications.
Testing whether a man-made artefact “works” generally means that there is an expectation of how it should behave under certain circumstances. Testing ensures that a thing behaves according to its specification—that the observed outcome matches the predicted outcome, where the prediction is based on knowledge of the artefact’s internal construction. Testing and specification are two sides of the same coin (i.e. dual to each other). Functional requirements are descriptions of behaviour that needs to be built. Functional requirements tests verify that behaviour.
We limit the analysis to system functionality by ignoring memory footprint or execution speed characteristics, as well as security or usability considerations.
In software engineering, we can easily test so-called ‘pure’ functions. Pure functions take inputs and return corresponding outputs. Testing pure functions means comparing outputs for selected inputs to see whether they match expected outputs. We will refer to pure functions simply as functions, in distinction from functions with ‘side-effects’, which we will call procedures and which we’ll talk about later.
Mathematically speaking, a function maps elements from one set to another set. It is stricter than a relation in that it cannot relate items from two sets arbitrarily. Instead, there is a distinction between a domain set and a co-domain set. A function must map all elements of the domain set into the co-domain set. The subset of elements that are actually mapped onto is the range set. So multiple elements from the domain might map onto the same element, but not every element in the co-domain is guaranteed to be mapped onto. Importantly, a single element from the domain only ever maps onto the same target element.
This means that everything about a function is fully deterministic, which is precisely what makes so-called pure functions so easy to test: as mentioned, we only need to look at inputs and outputs. In mathematics, there exist only pure functions. Additionally, with mathematical functions there is no notion of time.
However, in the realm of computation, a function can only calculate an output by executing a sequence of steps, which taken together can be said to form an algorithm. With algorithms, what might happen is that a given sequence of steps never comes to an end, even if infinite time were available. In computer science, this is known as the ‘halting problem.’ For the present analysis, we’ll ignore this issue. It is the engineer’s task to select algorithms that terminate. My aim here is to say something meaningful about functional requirements testing.
In computer programming, as in mathematics, there exist functions that take a single parameter, like the square function, which maps 4 onto 16. Often in computer programming, however—and this is possible in mathematics as well—we encounter functions taking multiple parameters.
For example, we might encounter (add 1 2) (I will use LISP syntax). From a theoretical standpoint, however, there is no difference between that and saying (add (1 2)). That is, we pass a complex argument instead of two separate ones. We always map one value onto another value. Here we map (1 2) onto 3. And it will always only map onto 3. There is no circumstance in which that input value will ever yield anything else. We are still dealing with a function, after all—only one that happens to take a complex argument.
The complexity of input arguments is arbitrary, but we should explain something. If we take (1 2) to carry ordering semantics, then (1 2) is different from (2 1). If we take it to express sets, those two things are the same. If we consider maps (dictionaries, key-value pairs), we can compare things like {:a 1 :b 2} and {:b 2 :a 1} and say they are the same. We can allow data structures to be nested, i.e., have something like {:a (1 2)}. Given that we specify the necessary equality rules, any such argument can be seen as an element of a domain set, different from all other elements in that set.
What constitutes the domain of any particular function is that there are rules for allowed values. A function may specify that it only takes strictly increasing numbers, so it might take (1 2 3) but not (1 2 2 3), the latter of which would not be included in the function’s domain. In programming, should a user attempt to pass such an argument, they should expect a failure. Ideally, the programmer of that function has precise knowledge of the domain and puts a guard in place to signal the error and prevent any further computation steps. If there is no guard in place, the computation may either fail in unexpected ways (a classic example would be ‘division by zero’), or calculate a value for an input that it should not compute according to specification.
Now, outside of pure functions—and I think predating them in computer programming—there are procedures, which are sequences of steps that modify some state in the computer’s memory. Here is a Pascal program:
program Example;
var count: Integer; { Global variable }
procedure Increment;
begin count := count + 1; { Modifies global variable }
end;
begin
count := 0;
WriteLn(’Before: ‘, count); { Output: 0 }
Increment;
WriteLn(’After: ‘, count); { Output: 1 }
end.
Note that we neither pass arguments to Increment nor return a value from it. Instead, we modify state defined external to the procedure itself. This is what we call a side-effect.
However, we can reinterpret what we are seeing here. We can say that Increment takes count as an argument and returns the new count.
We can extend this argument to say that any amount of external state can be incorporated into a function call—when passed into the function as a complex argument, exactly in the way I have shown above; this may be many more variables like count, complete databases, or the computer’s entire memory. We call this full state “the world,” and at every step we compute the next state based on the last state of the world with the invocation of a function.[1]
When you see the whole computer as a function, you can definitely say that the next computed step is completely determined. Putting the whole computer in a test harness and controlling all its inputs and outputs at the same time is an ideal situation, in which pressing a button on the keyboard yields a specific next state of the world.
However, we don’t construct the whole computer including its software. What we construct is a system sharing a computer with other processes running on it, and our computer can be connected to other computers, all of which lead to modifications to its state in ways we cannot control.
So, from the perspective of a single procedure that we want to test—in contrast to pure functions—it is not possible to strictly equate what is happening in the Pascal example to a function invocation. For the count may have been changed in the meanwhile by some other process—a phenomenon we call ‘concurrency.’
Now, while we cannot equate functions and procedures, testing functional requirements of modules of a certain complexity requires us to treat procedures as functions - simply because there is no other way.
And we want to test modules of a “certain complexity” because ultimately we are only interested in the behavior of the system under construction as a whole. And this system is simply the largest-scoped module we have full control over. A module is a grouping of procedures into larger procedures. A system, therefore, is modules “all the way down,” to put it succinctly.
The way we go about this is the same way that physical experiments work: by “holding everything else constant” (ceteris paribus - everything else being equal) and assuming a definite state of affairs. Returning to our example, this means assuming either that the counter was modified in the meanwhile (and to what value), or that it wasn’t. We need to decide this either way for a given test case, in order to be able to relate inputs we are interested in to outputs we are interested in. If there are multiple things that could happen (multiple values the counter could have taken on in our example), we need to put different test cases in place which capture these possibilies—where they make a difference. These cases have to be identified by the engineer in the same way that he needs to identify different relevant cases of inputs which partition the function’s outputs into distinguishable groups (we will have to come back to the issue of which tests cases to write in a follow-up article).
At any rate, to reiterate, for practical reasons, testing needs to focus on the system under construction. And the value of the system lies precisely in its having side effects—be these changing the state of the user interface (showing a different screen), modifying the database state, or writing log files. Holding things constant thus entails cutting off the procedure’s interaction with the “external world” (as seen from the procedure under test) and controlling those interactions directly—thereby making it a function.
We do this in various ways, for example, by setting up a certain database state (which under realistic conditions could have been modified in the meanwhile by other processes, just like our counter could have been modified), or by simulating a certain response for an outgoing network call. Again, all of this serves to convert a procedure to a function—the only thing one can actually test the functionality of in functional requirements tests, as the ‘functional’ adjective interestingly suggests.
A note on concurrency is in order here. A computation consisting of steps takes time—time in which a procedure might modify external state, and pull in external state at various points. We have already said that our counter or database might have been modified in the meanwhile. While we can account for different ways in which such state might have been modified which affect us, we can’t model any possible interdependence between us writing state and how that possibly might affect state we read in a subsequent execution step. Were we to do that, we would immediately extend the scope of the system under test and construction to the external systems and processes concurrently affecting our state—which is a boundless endeavour. Our earlier definition of a system as the biggest module we have full control over was precisely about that.
Therefore we need to disregard any relationship between our writing and reading and treat everything inbetween as a non-deterministic black box. We limit our interest in what external state we write (if it is not returned as a result) and what external state we ask for (if it is not passed in as an argument). The writing to external state can be modelled in the function context by returning instructions to write that state, instead of actually performing writes. The pulling in of external state can be modelled as additional argument parts of the single argument in the form: “when asked the first time for x, return 1, when asked the second time for x, return 3, etc.”[2]
In the ‘Functional Programming’ paradigm, great emphasis is placed on dealing with functions wherever we can and composing bigger functions from smaller functions. Dealing with functions takes concurrent modifications out of the picture. They are easily testable and easily understandable. In general, it has been pointed out that good testability reflects good architecture. This is certainly true. It also makes everyone’s life easier if what we earlier called “what we are interested in” is explicitly captured in a (pure) function’s parameter signature, in contrast of procedures which “reach out” into the world external to them. It is simply easier to understand[3], therefore easier to construct and test.
That said, dealing with state in functional requirements tests is nothing we can avoid. Nothing changes the fact that at some point we need to apply our conversion trick when we test something smaller than a computer and something bigger than simple functions[4]—which we need to do when we want to test side effects, which is when we do functional requirements testing of meaningfully large chunks of the systems we construct.
In conclusion, as far as functional requirements are concerned, our only means to verify whether those are met consists in testing functions. Therefore, when a system under test is a procedure, we need to arrange for it to behave like a function.
Footnotes
This isn’t overly complicated, but it’s important enough that I want to credit Programming in Haskell (second edition) by Graham Hutton, which really drove the point home for me. See the opening of Chapter 10 in particular.
We can formulate this as data, for example
{:when f-called :return [1 3]}.We say that pure functions allow for ‘local reasoning’, which keeps the cognitive load light.
Better said, the biggest functions we manage to construct before we necessarily need to incorporate side effects. See ‘functional core, imperative shell’ pattern.