Writing tests for our code and applications is critical. It gives us a repeatable, reliable way to ensure the code we write does what it should do and doesn’t do what it shouldn’t. Often, code coverage is used as a way to measure how much of a codebase is tested, but it only tells part of the story. More importantly than reaching the fabled 100% code coverage metric, we should be concentrating on building valuable tests.

What is Code Coverage?

Code coverage in unit tests is a measurement of which lines of code are executed during a run of tests. If you’ve got Xdebug installed and run your test suite, you have the option of outputting code coverage data in a number of formats, including XML, HTML, or even PHP arrays with each line of each file listed along with whether it was executed or not. In the HTML report, if the line was run, it is colored green and if it was not executed, it has a red/pink background. If your tests run every line of code in your application, you’ll see 100% code coverage. The coverage report will give you numbers and percentages by file, class, and line.

There’s a very important distinction, however, between ensuring that 100% of the lines of code are executed and 100% of the lines of code are tested. Relatively speaking, it’s “easy” to get to 100% code coverage, but it’s very difficult to ensure 100% of the lines of code are tested. In this article we’ll be talking about ensuring that the lines of code that are covered are covered because they’ve been effectively tested.

What Makes a Valuable Test?

In my opinion, there are a couple of aspects that make tests valuable. If the test helps ensure that the code does what it should do and doesn’t do what it shouldn’t, then it is valuable. If the test provides a safety net that ensure that refactoring doesn’t break things, then it’s valuable. So our benefits that we’re looking for in the test are that changes to our code that don’t break things should be easier since we’ll know if we’ve goofed and broken things, and changes that do break things should cause test failures so we know when important behavior that we rely on is no longer happening.

Unfortunately, not all tests are valuable. Some tests make it hard to change code for no real good reason. Typically these are tests that get too deep into the implementation of the code under test. Ideally, most of our tests will send input into a method and ensure the output or return value is what is expected. For simple “pure function” tests, this isn’t too hard. But in the real world, we’re going to be testing things that have side-effects, or dependencies, and walking the line between providing valuable assurances and tying ourselves to a specific implementation can be very difficult without making the test fragile or removing any assertions that actually cause the test to actually test something.

Other aspects that make a test valuable is that it must be consistent. What I mean is that if you run the test against the same code over and over, the result should always be the same. If the test fails, it should fail every time it is run until we either change the code or we change the test. Tests that inconsistently pass or fail on subsequent runs are less valuable because they introduce doubt into the process of running the tests. If there is doubt then developers learn to mistrust the tests or ignore build failures. This isn’t to say that all your tests should necessarily use static input. I’m a big fan of using appropriate random inputs in some tests. I know not everyone agrees, so let’s take a look at some examples and reasons.

Random Input For Tests

When following the standard TDD approach to building software, the normal loop is to write a test, run it, see it fail, then write the minimum code to make sure all the tests are passing. What that means is that once you’ve been writing code like this for a bit, you end up with a first test that is trivial, and code that is equally trivial. For instance in the following snippet from a PHPUnit test:

public function testAddCanAddTwoAndTwo()
{
    $this->assertEquals(4, $this->calculator->add(2, 2));
}

If we’re building a simple calculator using TDD, and this is our first test, then the simplest code that makes it work is something like this:

class Calculator
{
    public function add($x, $y)
    {
        return 4;
    }
}

At this point, we have a fully functional calculator that correctly works and returns 100% correct results for adding any two numbers as long as the two values we add total to four. However, it fails pretty badly on all the other sets of input. In order to get the next level of being able to add up other values besides those that add to four, we add another test:

public function testAddCanAddTwoAndThree()
{
	$this->assertEquals(5, $this->calculator(3, 2));
}

With this test in place, the simple return 4; code no longer suffices. But, it doesn’t completely guarantee that the code will be correct though. In fact, the following code will pass these two tests, but it’s actually an even worse calculator adding function:

public function add($x, $y)
{
	if ($x == 3) {
		return 5;
	}
	return 4;
}

Now clearly going from what we had to this is ridiculous, but it illustrates why I do like to use random inputs on occasion, and when I know the inputs will be within an acceptable range. Let’s take a look at the test with random inputs:

public function testTheCalculatorCanAddInputs()
{
	$x = rand(1, 10000000);
	$y = rand(1, 10000000);
	$this->assertEquals($x + $y, $this->calculator->add($x, $y));
}

With this code in place, in all likelihood our calculator add method will end up being return $x + $y;. If we know the inputs are going to be integers or numeric values then this method is likely all we’ll need. Of course if we can’t be sure of the incoming values, then we’ll need more tests and more results to be able to ensure the behavior is as expected. Perhaps the calculator needs to throw an exception if it is passed non-numeric strings, or objects, or arrays. Or maybe we need to ensure that we have some expected behavior if the two values add up to something larger than PHP_MAX_INT. Each of these situations would require additional tests.

Now clearly the “add” is trivial and silly, and we’re really duplicating all the work the method does right there in the test. But randomly generated values can also be a good way to ensure that your parameters are being used appropriately. If you’ve build a class that performs a SQL query, perhaps one of your incoming parameters will be an id. By randomizing that id, and then asserting that the mock database object sees that same random value, you’ve pretty close to guaranteed that the incoming id is being sent into the query.

Let’s explore another situation that has a more limited input scope for what would be consider valid and invalid. Suppose we want to build a validator that ensure some incoming value is a string containing between 12 and 16 digits. In order to thoroughly test, we’d want to test that the validator return an invalid response for inputs that are empty or up to eleven characters, as well as an invalid response for inputs over 17 characters. Next we want to ensure that inputs that are between 12 and 16 characters but are not just digits will be invalid.

The following test illustrates is an example of a test with random inputs that will pass most of the time but not always and unfortunately, closely mirrors some real tests I have seen in real project code. These are the types of tests we want to avoid because they cannot be trusted.

public function testStringsWithLettersAreInvalid()
{
    $length = rand(12, 16);
    $string = substr(md5(uniqid(), 0, $length);
    
    $this->assertFalse($this->validator->isValid($string));
}

On the surface, it looks decent. We’re testing a string with the valid length range of 12 to 16, and we’re build the random string from the result of md5 on a random string. The output of md5 is a 32 character string made up of lowercase hexadecimal digits (essentially, 0-9 and a-f). Most of the time, this test will work. However, a problem happens because sometimes, md5 will return a hash that is entirely digits, or even more commonly, the first 12-16 characters could consist of just numbers. Whenever this happens, this test will fail because the resulting string will be counted as valid when the test expects an invalid response. Tests that include random values that do not always fall within the correct or expected domain are bad and should be avoided. What I mean by this is that the test is always expecting to have a generated value that is invalid. However, the code that generates this input will sometimes generate a valid input. So we’ve now violated the domain of inputs by generating an input that crosses the boundary between invalid and valid.

Ensuring Tests Are Effective

At this point, I’m ready to start talking about the tool and concept that I hinted at last month. The concept is known as mutation testing. If our tests are supposed to let us know when the code is broken, then making certain small logic changes to the code should result in a test failure. If it doesn’t then we can be reasonably certain that our test suite doesn’t adequately test some parts of our code.

Let’s take a look at a simple setter/getter and a test.

public function setFoo($foo)
{
    $this->foo = $foo;
    return $this;
}

public function getFoo()
{
    return $this->foo;
}

The test for this could look like this:

public testClassCanStoreAFooValue()
{
    $value = uniqid('foo_value_');
    $this->thingToTest->setFoo($value);
    $this->assertEquals($value, $this->thingToTest->getFoo());
}

If we were to run this, we’d see that we’ve managed to get 100% code coverage on the setFoo and getFoo methods. However, not all the lines of code are actually tested. There’s nothing in that test that ensures that the setFoo method is providing a fluent interface and there may be code that relies on it. If we removed the return $this line of code from the setter, the test would continue to pass.

Enter Humbug

In late 2014, Pádraic Brady made the first commits to his mutation testing tool known as Humbug. The tool inspects the tests and code under test and determines a number of changes that can be made. For instance, Humbug will find the return $this; line and change it to return null;. It will then run the unit tests and if nothing fails, it counts that run as an ‘escaped mutant’. If the test fail, then the mutant has been killed and the test was effective.

Installation instructions for Humbug can be found on the Humbug github page1. In order to execute Humbug, you must first have a working (read passing) unit test suite. Humbug will execute it, figuring out what code is covered. It will then run through an determine what mutations are possible. Then one at a time, it will change the code with each mutation for the lines of code that are covered and run the test suite. If running the test suite results in a test failure, a fatal error or a timeout, it considers that the mutation has been detected by the test suite. If the test suite passes, then it considers the mutant as having escaped.

Let’s take a look at how it can work. First of all, you need a project with some sort of PHPUnit test suite. After you install Humbug, you’ll need to generate a configuration file for Humbug. This is done with humbug configure. It will prompt you to enter a few values: the source directories you want to include, any you want to exclude, how long Humbug should wait before saying the test timed out, and where to store the logs, both text and json.

Once that’s in place, you run Humbug with humbug run. Let’s take a look at some code samples to go along with all of this.

Listing 1 is the ValueObject class. For right now, it’s just a simple class that holds some value in a variable named foo. There are a setter and getter for foo, which will be the subject of our tests.

<?php

namespace LevelUp;

class ValueObject
{
    private $foo;

    /**
     * @return mixed|null
     */
    public function getFoo()
    {
        return $this->foo;
    }

    /**
     * Sets the foo value for the object
     *
     * @param mixed $foo
     *
     * @return void
     */
    public function setFoo($foo)
    {
        $this->foo = $foo;
        return $this;
    }
}

Listing 2 is the test for the ValueObject class. It’s testing that whatever is passed into the setter for foo is what we get back out from the getter. With these two files in place, along with the appropriate autoloading and phpunit.xml config, we have a passing unit test suite with a single test. We have to have a passing unit test suite for Humbug to work.

<?php
namespace LevelUpTest;

use LevelUp\ValueObject;

class ValueObjectTest extends \PHPUnit_Framework_TestCase
{
    /** @var ValueObject */
    private $valueObject;

    public function setUp()
    {
        $this->valueObject = new ValueObject();
    }

    public function testItCanStoreAFoo()
    {
        $foo = uniqid('something_');
        $this->valueObject->setFoo($foo);
        $this->assertEquals($foo, $this->valueObject->getFoo());
    }
}

Listing 3 shows the result of humbug run. What happens initially upon running Humbug is that it executes your test suite, and looks at all the code that is mentioned in the coverage report. If you’ve enabled phpunit to process all uncovered files from your whitelist, then you’ll potentially see a whole lot of work being done. Humbug looks at the lines of code listed and looks for potential mutations. We’ll talk about what those mutations are shortly. Once it has identified all the mutations it will only run the ones that involve code that is covered. For this example, that’s all of them because there is so little code. The M in the output indicates that the mutation that was generated was not caught by our tests. Humbug also includes a few metrics to let us know how things went. The Mutation Score Indicator (MSI) indicates how many of the mutations are caught or killed. In this example, it’s 0% since they all escaped. Mutation code coverage indicates how much of the codebase is covered by the mutations that are executed. Finally the Covered Code MSI indicates how many of the executed mutants were killed or detected.

 _  _            _
| || |_  _ _ __ | |__ _  _ __ _
| __ | || | '  \| '_ \ || / _` |
|_||_|\_,_|_|_|_|_.__/\_,_\__, |
                          |___/
Humbug version 1.0.0-alpha1-18-gd102496

Humbug running test suite to generate logs and code coverage data...

    1 [==========================================================]  1 sec

Humbug has completed the initial test run successfully.
Tests: 1 Line Coverage: 100.00%

Humbug is analysing source files...

Mutation Testing is commencing on 2 files...
(.: killed, M: escaped, S: uncovered, E: fatal error, T: timed out)

M

1 mutations were generated:
       0 mutants were killed
       0 mutants were not covered by tests
       1 covered mutants were not detected
       0 fatal errors were encountered
       0 time outs were encountered

Metrics:
    Mutation Score Indicator (MSI): 0%
    Mutation Code Coverage: 100%
    Covered Code MSI: 0%

Remember that some mutants will inevitably be harmless (i.e. false positives).

Time: 232 milliseconds Memory: 7.00MB
Humbug results are being logged as JSON to: humbuglog.json
Humbug results are being logged as TEXT to: humbuglog.txt

The log files that are generated are very useful as well. While they don’t show every potential mutation found or even run, it does show any of the mutations that escaped and what the change was. Listing 4 shows the log for our first run with the sample code. Or rather, it shows part of the log. The text from Listing 3 showed what appears on the command line when Humbug is run. That same output also appears in the humbug log but I’ve removed it for inclusion here. Looking at Listing 4, we can find that the mutation that was not detected or killed was to transform the return $this; from our setter to return null;. Since the test didn’t test that the method is fluent, changing the setter to be non-fluent did not induce a test failure, and the mutation was not caught.

------
Escapes
------


1) \Humbug\Mutator\ReturnValue\This
Diff on \LevelUp\ValueObject::setFoo() in /Users/davidstockton/Projects/humbug_demo/src/LevelUp/ValueObject.php:
--- Original
+++ New
@@ @@
         $this->foo = $foo;
-        return $this;
+        return null;
     }
 }

Unlike a phpunit run where dots are typically the only good thing and all you want to see, in a Humbug run, it is different. Of course dots are the best, but E and T may be acceptable as well. E indicates that the mutation resulted in a fatal error when the tests were run. A T indicates that the test timed out. This can happen because a mutation could introduce an infinite loop. There are two other letters that may appear. An S indicates that none of the tests covered that line of code (or specifically the mutation). An M means that a mutation was found, the code was changed but the tests still passed. We want to avoid escaped mutants. For our simple value object, we need to add another test or assertion that ensures that the setter is fluent.

We can update the setter test to this:

    public function testItCanStoreAFoo()
    {
        $foo = uniqid('something_');
        $result = $this->valueObject->setFoo($foo);
        $this->assertEquals($foo, $this->valueObject->getFoo());
        $this->assertSame($this->valueObject, $result);
    }

Now if we run Humbug again, we get a single dot. We also achieve 100% in all of the metrics indicating that we’ve got 100% for traditional code coverage as well coverage that ensures all mutations were caught and killed.

Humbug Mutations

There are a number of code patterns that Humbug looks for to introduce mutations. We’ve already talked about changing a return $this; to return null;. Additionally, Humbug will change + signs to - and vice versa. It will exchange * and /. The modulus operator (%) will become multiplication. Raising to a power with ** will turn into a division operation. It will change += to -= and *= to /=. In bitwise operators, it will exchange & and | and swap the direction of the >> and << bitshift operators.

For logic changes, true and false, && and || are never safe when Humbug is around. It will swap them for the other where they are found. The same goes for and and or. If it finds the not (!) operator, it will simply remove it and see if your tests notice. For inequalities, a strict greater than (>) is transformed to a >=, while a < turns into a <=. The opposite directions also apply. It will change conditionals like == to their negated != versions. The increment (++) and decrement (--) operators will be swapped too.

It will mutate return values as we saw before, but it will swap return true and return false, as well as changing things like return 0 to return 1. In addition, any return (anything); will be changed to (anything); return;. Some literal numbers like 0 and 1 will be changed whenever found. And there are more that fall into these categories that I haven’t listed. In short, Humbug will identify and potentially make a lot of changes to the code you’re testing to see if your tests are really testing that the code does what it should. What this means is, the more code you’re testing, and especially if it falls into any of the above categories or samples, a plethora of mutations can be identified and potentially be executed. While this isn’t slow, it’s definitely not fast. If your code takes perhaps 5 or 10 seconds to run your tests normally, running the mutation tests could take 5 or 10 times longer to execute.

Using Humbug to Improve Your Tests

The code I work with most often is a Zend Framework 2 application. Each module has its own test suite separated from all the other test suites. One thing that I found helpful when working with Humbug was to pick a module, run Humbug and then use the output to improve the tests. For a module with no actual tests, I might have an entire block of just S tests indicated there’s no code coverage to start. So I can then pick a class and set about writing tests for the methods in the class. Then I’ll run Humbug again and see how it changes the output. Whatever lines of code I’ve managed to cover with PHPUnit will now change from an S in Humbug to something else. In some cases, I may get lucky and get a dot (.) from the start. In other cases, I may see some escaped mutant, M tests. There may also be some E or T tests as well.

I then concentrate on killing as many mutants as I can. Ideally, I don’t want to stop while there are any M tests left. The log file is very useful because it will show a diff format on the bit of code that changed for each escaped mutant and I can use that to move forward and devise a test that will catch and kill the mutant. In some cases, the best I can do is an E or a T test. Some of the time, I may make an effort to restructure my original code in such a way that I can convert those to a .. After I’ve had enough for the day, I’ll remove my Humbug configs and logs and commit the test and code changes into source control.

I haven’t been automatically running Humbug on the continuous integration servers because I feel it would dramatically increase the run time of the builds and we’re not always going to be paying attention to the Humbug output. Instead, I’d rather focus on improving the tests a bit at a time and then ensuring that the result of those exercises are run automatically by the CI server. That seems to bring the biggest bang for the buck with this type of testing.

Conclusion

Mutation testing with Humbug is a powerful way to ensure that your tests are worthwhile and that they are actually testing your code in a meaningful way. If the tests you have are not actually able to detect bugs when they are introduced then they may be giving you a false sense of security. Even though Humbug is still technically alpha, I’d recommend trying it out on your own code and see how well your tests actually test your code. If you’re not running PHPUnit but do have tests in another framework like PHPSpec, unfortunately, at least for right now, you’re out of luck. There is effort to expand Humbug outside of just PHPUnit, but for now that’s what is there. If you’re interested in reading more on the topic, please check out Pádraic Brady’s blog on the topic. See you next month.