19 min read

In this article by Justin Bozonier, the author of the book Test Driven Machine Learning, we will see how to develop complex software (sometimes rooted in randomness) in small, controlled steps also it will guide you on how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something the book will achieve. Instead, the book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them.

We will answer the following three questions in this article:

  • What are TDD and behavior-driven development (BDD)?
  • How do we apply these concepts to machine learning, and making inferences and predictions?
  • How does this work in practice?

(For more resources related to this topic, see here.)

After having answers to these questions, we will be ready to move onto tackling real problems. The book is about applying these concepts to solve machine learning problems. This article is the largest theoretical explanation that we will have with the remainder of the theory being described by example.

Due to the focus on application, you will learn much more than what you can learn about the theory of TDD and BDD. To read more about the theory and ideals, search the internet for articles written by the following:

  • Kent Beck—The father of TDD
  • Dan North—The father of BDD
  • Martin Fowler—The father of refactoring, he has also created a large knowledge base, on these topics
  • James Shore—one of the author of The Art of Agile Development, has a deep theoretical understanding of TDD, and explains the practical value of it quite well

These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this article as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application.

Test-driven development

Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows:

  • Don’t write a line of new code unless you first have a failing automated test
  • Eliminate duplication

This as he noted fairly quickly leads us to a mantra, really the mantra of TDD: Red, Green, Refactor.

If this is a bit abstract, let me restate it that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications.

TDD is a software development process that requires the following:

  • Writing code to detect the intended behavioral change.
  • Rapid iteration cycle that produces working software after each iteration.
  • It clearly defines what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature.

Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn’t apply, and then work your code back to a state where the rest of your tests pass. There’s a lot of trial and error inherent in TDD, but the same matter applies to machine learning. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline.

You might be thinking that just thinking through test cases is equivalent to TDD. If you are like the most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to our software design that prevents you from writing “just in case” code. By this I mean the code that we write just because we aren’t sure if there will be a problem. Using TDD, we think of a test case, prove that it isn’t supported currently, and then fix it. If we can’t think of a test case, we don’t add code.

TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Lets take a step back and reframe what I just said.

In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and the entire machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or the interactive (I)Python session.

The TDD cycle

The TDD cycle consists of writing a small function in the code that attempts to do something that we haven’t programmed yet. These small test methods will have three main sections; the first section is where we set up our objects or test data; another section is where we invoke the code that we’re testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart’s content. In this context, refactor refers to changing how your code is written, but not changing how it behaves.

Lets examine more deeply the three steps of TDD: Red, Green, and Refactor.

Red

First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a better than random test. It might even be predicts random things, or even simpler always predicts the same thing. Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start, though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing the one true algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it’s deployed.

Green

After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are the lower-level concerns. I’ll dive deeper into this later on in this article but for now, just know that you want to get your test passing as soon as possible; lie, cheat, and steal to get there. I promise that cheating actually makes your software’s test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step.

Refactor

You got your test to pass through all the manners of hackery. Now, you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Then you write the software without tests. One way where you will know for sure that you are no longer refactoring is that you’ve broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious but this isn’t all that it takes for you to know that you haven’t changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. By the way of his illustration in this book, refactoring code becomes a set of forms and movements not unlike karate katas.

This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem?

Behavior-driven development

BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. Some issues he noticed were the following:

  • People had a hard time understanding what they should test next.
  • Deciding what to name a test could be difficult.
  • How much to test in a single test always seemed arbitrary.

Now that we have some context, we can define what exactly BDD is. Simply put, it’s about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of given, when, then. Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests as you will see soon. Here’s a concrete example of a test description written in this style Given an empty dataset when the classifier is trained, it should throw an invalid operation exception.

This sentence probably seems like a small enough unit of work to tackle, but notice that it’s also a piece of work that any business user, who is familiar with the domain that you’re working in, would understand and have an opinion on.

You can read more about Dan North’s point of view in this article on his website at dannorth.net/introducing-bdd/.

The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn’t seem worthwhile. The approach you will learn in the book will take a simplicity first approach to make it as easy as possible for someone with zero background to get up to speed.

With this in mind, lets work through an example.

Our first test

Let’s start with an example of what a test looks like in Python. The main reason for using this is that while it is a bit of a pain to install a library, this library, in particular, will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution where we find that we need the extra features.

First, install it like this:

pip install nose

If you have never used pip before, then it is time for you to know that it is a very simple way to install new Python libraries.

Now, as a hello world style example, lets pretend that we’re building a class that will guess a number using the previous guesses to inform it. This is the first simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote.

First, we must write a failing test. The simplest failing test that I can think of is the following:

def given_no_information_when_asked_to_guess_test():
number_guesser = NumberGuesser()
result = number_guesser.guess()
assert result is None, "Then it should provide no result."

The context for assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You’re not missing any steps; this class doesn’t exist yet. This seems roughly like how I’d want to use it. So, it’s a great place to start with. Since it doesn’t exist, wouldn’t you expect this test to fail? Lets test this hypothesis.

To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following:

nosetests

When I do this, I get the following result:

Here’s a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven’t actually written the code yet. Throughout the book, we’ll reduce the detail of the stack traces that we show. For now, we’ll keep things detailed to make sure that we’re on the same page. At this point, we’re in a red state in the TDD cycle. Use the following steps to create our first successful test:

  1. Now, create the following class in a file named NumberGuesser.py:
    class NumberGuesser:
    """Guesses numbers based on the history of your input""
  2. Import the new class at the top of my test file with a simple import NumberGuesser statement.
  3. I rerun nosetests, and get the following:
    TypeError: 'module' object is not callable

    Oh whoops! I guess that’s not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can’t stress this point enough; bear with me for the time being.

  4. Change the import statement to the following:
    from NumberGuesser import NumberGuesser
  5. Rerun nosetests and you will see the following:
    AttributeError: NumberGuesser instance has no attribute 'guess'
  6. The error message has changed, and is leading to the next thing that needs to be changed. From here, just implement what we think we need for the test to pass:
    class NumberGuesser:
    """Guesses numbers based on the history of your input"""
    def guess(self):
       return None
  7. On rerunning the nosetests, we’ll get the following result:

That’s it! Our first successful test! Some of these steps seem so tiny so as to not being worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we’ll be keeping our steps pretty small if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Lets take a step back and look at what this first round of testing took.

Anatomy of a test

Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn’t exist. Next, I created the class, but then Python complained that I didn’t import it correctly. So then, I imported it correctly, and Python complained that my guess method didn’t exist. In response, I implemented the way that my test expected, and Python stopped complaining.

This is the spirit of TDD. You have a conversation between you and your system. You can work in steps as little or as large as you’re comfortable with. What I did previously could’ve been entirely skipped over, and the Python class could have been written and imported correctly the first time. The longer you go without talking to the system, the more likely you are to stray from the path to getting things working as simply as possible.

Lets zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I’ve commented it, and broken it into sections that you will see recurring in every test that you write:

def given_no_information_when_asked_to_guess_test():
# given
number_guesser = NumberGuesser()
# when
guessed_number = number_guesser.guess()
# then
assert guessed_number is None, 'there should be no guess.'

Given

This section sets up the context for the test. In the previous test, you acquired that I didn’t provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as Given this scenario… In our test, we might say Given no prior information for NumberGuesser…

When

This should be one of the simplest aspects of our test. Once you’ve set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as When this happens… In our test we might say When NumberGuesser guesses a number…

Then

This section of our test will check on the state of our variables and any return result if applicable. Again, this section should also be fairly straight-forward, as there should be only a single action that causes a change into your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example maybe loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us?

In the book, you will find examples where we’ll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing.

Given, When, Then is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and an eliminate duplication).

It’s a small thing to be passionate about and if it doesn’t speak to you, just translate this back into Arrange, act, assert in your head. At the very least, consider it as well as why these specific, very deliberate words are used.

Applied to machine learning

At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm, there exists a way to quantify the quality of what you’re doing. In the linear regression; it’s your adjusted R2 value; in classification problems, it’s an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough.

We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms are an entire field onto their own right that can be delved in more deeply.

In the book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier.

For a professional machine learning developer, it’s quite likely that an ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for the best model. By finding a way to work your financial assumptions into the model, we will have an improved ability to decide between the competing models.

Summary

In this article, you were introduced to TDD as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that the specifying behavior in the form of sentences makes for an easier to ready a set of specifications for your software.

Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now, we’ve seen that different models can be quantified; it follows that they can be compared.

Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here