Wednesday, June 17, 2009

Driving unit test takeup with code coverage

For anyone who has not fully gotten organised with unit testing and code coverage, this is for you! :) My project involved a large inherited codebase which has good black-box testing, but little unit testing and no coverage metrics. Tackling unit testing was always something impossible for me -- how do you take 96, 000 lines of Python and 1, 000, 000 lines of C, and build unit tests? (nb the lines-of-C count seems high, but it's what 'wc' said.)

The general advice is not to try -- but to get traction by just writing one unit test for the next bit of new code you write. Then, the next bit, etc. Eventually, you will have figured out unit testing and will then make an appropriate judgment on what to do about the body on untested code.

I have typically found this to be quite a large hill to climb. I work only on a subset of the code, which practically requires me to invoke most of the code just to get to my part. Most of my methods require such a lot of setup that it seemed quite infeasible to tackle unit testing without doing some more thinking about how to do this in a sane way. Setting up and tearing down my application was just not going to be feasible if I were going to put a lot of unit tests in place -- I reckon the setup would have cost between 1 and 7 minutes per test!

This got relegated to the too-hard basket, and set aside until now. Here's how I found a way to get traction.

What turns out to be pretty straightforward, is integrating coverage testing. You more-or-less just switch it on, and it will record to a file across multiple program invocations. This can be used to count coverage across test scripts, or user testing, or development mode use, or indeed in operations.

I ran this through about half my black-box tests, and found I was sitting at around 62% code coverage. That's not too shabby, I reckon! I know for a fact that there are quite large chunks of code which contain functionality which is not a part of any operational code path, but is preserved for possible future needs. I estimate 25% of our code would fall into that category, lifting the coverage to 87% for the sub-area of code which I work on. Now I've got numbers that look like a challenge, rather than an insoluble problem!

I think that's the key... make lifting code metrics an achievable challenge, then it will seem more attractive. It's probably important not to target a particular number. I know '100% or bust' may be what some advocates would put forward, but for anyone new to the concept, I personally feel that simply measuring the current coverage, then understanding where that number comes from and what it means, it the more important achievement.

What is clear is that I'm not going to easily lift my coverage metrics beyond a certain point simply through tactical black-box testing. I'm going to have to write tests which go through the code paths which aren't part of the operational configuration. I'm going to have to write tests with very specific setup conditions in order to get to lines of code which are designed to handle very specific conditions. All of a sudden, I've got achievable and well-defined goals for unit testing.

I call that a win!



  1. Nice blog post.

    That technique sounds good.

    Another one is to generate test stubs for code. That is you generate a whole bunch of failing tests for each of your functions/classes/modules.

    Make them fail in a special TODO way, so you can easily skip the TODO tests.

    This makes it much easier to fill in the tests as people go, and also gives you another metric.

    It doesn't address your case though - where you need big parts of the app running to test new bits.

    Anything to make test writing easier, and to also give you something to measure is good I think.

  2. Be warned that driving a metric up can be way wrong. For instance, some people can/will/do write huge integration tests with no asserts that exercise a large chunk of code but prove nothing.

    Also note that when you are working with TDD in legacy code, code coverage is really only telling you the % of code you've touched v. the total code base: 10% may mean 100% coverage of the 10% you've TDDed.

    Otherwise, more power to you. Just make sure the measurement doesn't become the goal.

  3. Thanks for the comments, all.

    Firstly, a huge test which exercises a large body of code does prove something, even though I fully understand why it's bad. It's still better to prove that those lines of code can run successfully than not to do so. However, the test fails to target any particular aspect of functionality. It's just an 'I didn't crash' test, not an 'I work as expected' test.

    However, I would say it's still better to do than than to do nothing. It kind of depends on the context. For example, I would be horrified if someone developed tests and an application from the ground up in that manner. However, I think it's a pretty fair starting point to take a legacy application, write some monolithic tests just to get your rig going, then attempt to transition towards something more meaningful.

    It's nothing like 'best practise' or useful at targeting functionality, but as a means of introducing the concept it's valid. For some people it's probably a real surprise (especially for non-IT managers or people who haven't come across it before) to even see these kinds of statistics, let alone be able to understand the merits of targeted testing straight away.

    For example, in my app, the test are at least input -> expected output, rather than run -> please don't crash. However, they're not testing particular methods, just whole-system behaviour. But without starting at the very beginning, we'll never make it to a more targeted test system.

    Your comment, though, fills me with horror at the idea of outsourcing development and relying on code metrics to track development. I think you really need to understand what the numbers mean rather than just focus on 'more is better'. It's also a way to transition towards Agile (if that's what you want).

    As you point out, metrics like code coverage are fractal. (i.e. the % of methods which are 100% covered will be different to the % coverage of the whole codebase). I think there is a lot more room for good metrics to aid in the understanding of code robustness.

    I like the idea that you could get 200% coverage if every line was touched twice so long as at least one used variable was different (i.e. you are in a different code path).

    Anyway, thanks all for your responses!!