Writing Unit Tests That Actually Catch Bugs


A lot of unit tests are written primarily to satisfy a coverage metric. The test runs, it passes, the coverage number goes up, and the developer moves on. This is not malicious - it’s usually just what happens when tests are added after the fact, when the behavior is already known to be correct.

The problem is that these tests don’t catch bugs. They verify the current behavior, they will continue to pass whether the behavior is correct or not, and they add noise when you refactor.

A test that catches bugs is different in a specific way: it specifies what the behavior should be, not what the behavior currently is. When the behavior diverges from the specification, the test fails. When a developer makes a change that accidentally breaks something, the test catches it.

This distinction sounds obvious and is easy to forget in practice.

What to Test

The clearest unit test targets are functions with these properties:

  • They take inputs and return outputs
  • The behavior is deterministic
  • The interesting cases include edge conditions, boundary values, and error handling

Take a function that calculates order totals with discounts:

def calculate_total(items: list[Item], coupon: Coupon | None = None) -> Decimal:
    subtotal = sum(item.price * item.quantity for item in items)
    if not items:
        return Decimal("0")
    if coupon and coupon.is_valid():
        discount = subtotal * coupon.discount_rate
        return subtotal - discount
    return subtotal

The interesting test cases aren’t just “it returns the right number for a normal order.” They’re:

  • Empty item list
  • Coupon applied correctly
  • Expired coupon not applied
  • Coupon with zero discount
  • Items with zero quantity
def test_calculate_total_empty_items():
    assert calculate_total([]) == Decimal("0")

def test_calculate_total_applies_valid_coupon():
    items = [Item(price=Decimal("100"), quantity=2)]
    coupon = Coupon(discount_rate=Decimal("0.1"), expires=future_date())
    result = calculate_total(items, coupon)
    assert result == Decimal("180")  # 200 - 10% = 180

def test_calculate_total_ignores_expired_coupon():
    items = [Item(price=Decimal("100"), quantity=1)]
    coupon = Coupon(discount_rate=Decimal("0.5"), expires=past_date())
    result = calculate_total(items, coupon)
    assert result == Decimal("100")  # No discount applied

Each test specifies an expected behavior. If someone accidentally changes is_valid() to always return True, or changes the discount calculation, one of these tests will fail. That’s the job.

Test Doubles: What They’re For and Where They Go Wrong

A test double is anything that replaces a real dependency in a test: mocks, stubs, fakes, spies. They exist for a specific reason: to isolate the unit under test from dependencies that make tests slow (databases, network calls) or non-deterministic (current time, random numbers).

The mistake is using test doubles to avoid the difficulty of setting up a real test environment, then ending up with tests that verify the mock rather than the code.

# This test is mostly testing that you know how to use Mock
def test_get_user():
    mock_repo = Mock()
    mock_repo.find_by_id.return_value = User(id=1, name="Alice")
    service = UserService(mock_repo)

    user = service.get_user(1)

    mock_repo.find_by_id.assert_called_once_with(1)
    assert user.name == "Alice"

This test will pass even if UserService.get_user just returns mock_repo.find_by_id(id) with no logic at all. If there is business logic in get_user - authorization checks, data transformation, caching - you’re not testing any of it unless you’ve specifically set up the mock to exercise it.

The better question to ask before reaching for a mock: what would fail if the behavior I’m testing is wrong? If the answer is “nothing, because the mock always does what I tell it to,” the test probably isn’t testing anything useful.

Use real objects wherever the test is fast and deterministic. Use in-memory implementations (fakes) where a real implementation is too expensive. Use mocks where you genuinely need to verify that an interaction happened (an event was emitted, an email was sent).

Arrange, Act, Assert

Every test has three parts: setting up the state (Arrange), performing the action (Act), and checking the result (Assert). Making this structure explicit in your tests keeps them readable and makes failures clear.

def test_user_cannot_post_when_suspended():
    # Arrange
    user = User(id=1, status="suspended")
    forum = Forum()

    # Act
    result = forum.post(user, content="Hello")

    # Assert
    assert result.success is False
    assert result.error == "Account suspended"

When a test fails, you know: “the result of posting with a suspended user was not what we expected.” You don’t have to trace through five levels of setup to understand what the test was checking.

Keep each test focused on one behavior. A test that checks five things at once is harder to name, harder to read, and when it fails, harder to interpret. If you find yourself using “and” in a test name - “test_calculate_total_with_coupon_and_empty_items” - consider whether it should be two tests.

What Not to Test

There’s a class of tests that make code harder to change without providing meaningful coverage:

Private implementation details: if a public method produces the right result, the private methods it calls are correct by implication. Testing private methods directly couples your tests to implementation structure. When you refactor - splitting a method, renaming an internal helper - these tests break even though the behavior didn’t change.

Framework behavior: if you’re using a well-tested serialization library, you don’t need a test that it serializes correctly. Test that your code passes the right data to the library.

Trivial getters and setters: user.get_name() returning self.name doesn’t need a test.

Exact logging output: log messages change. Tests that assert exact log message strings will fail every time you improve a message, adding maintenance work with no reliability benefit.

The filter: would this test catch a real bug that could ship to production? If you can’t construct a plausible scenario where this test would fail while the software is actually wrong, the test probably isn’t pulling its weight.

Test Names as Specifications

A test name that describes the expected behavior reads like a specification:

test_calculate_discount_returns_zero_for_empty_cart
test_calculate_discount_applies_percentage_to_subtotal
test_calculate_discount_ignores_expired_coupons
test_calculate_discount_does_not_apply_discount_above_maximum

When one of these fails, you immediately know what broke. When you’re reading a test suite you didn’t write, these names tell you what the code is supposed to do.

Compare with:

test_discount_1
test_discount_edge_case
test_discount_happy_path

These tell you nothing. When test_discount_edge_case fails, you have to read the test body to find out what edge case was being tested and what went wrong.

Name tests by the behavior they verify: test_[unit]_[scenario]_[expected_result]. It’s verbose but precise.

The Test That Earns Its Place

The measure of a good test suite is this: when someone makes a mistake that would affect users, does a test fail?

Not: do we have 80% coverage? Not: do all tests pass? Those are inputs. The output is: does the suite detect regressions?

A small set of tests that fail precisely when behavior is wrong is more valuable than a large set of tests that pass no matter what. Write the tests that would have caught the last three bugs you shipped to production. Ignore coverage numbers that don’t correspond to actual risk.



Read more