@hackage golds-gym0.4.0.0

Golden testing framework for performance benchmarks

golds-gym 🏋️

CI golds-gym

Golden testing for performance benchmarks. Save timing baselines on first run, compare against them on subsequent runs.

Key Features:

  • Architecture-specific baselines (different hardware = different golden files)
  • Hybrid tolerance (handles both fast <1ms and slow operations)
  • Robust statistics mode (outlier detection, trimmed mean)
  • Lens-based custom expectations (assert "must be faster", compare by median, etc.)

Quick Start

import Test.Hspec
import Test.Hspec.BenchGolden
import Data.List (sort)

main :: IO ()
main = hspec $ do
  describe "Performance" $ do
    -- Pure function with normal form evaluation (deep, full evaluation)
    benchGolden "list append" $
      nf (\xs -> xs ++ xs) [1..1000]

    -- Weak head normal form (shallow, outermost constructor only)
    benchGolden "replicate" $
      whnf (replicate 1000) 42

    -- Custom configuration
    benchGoldenWith defaultBenchConfig
      { iterations = 500
      , tolerancePercent = 10.0
      }
      "sorting" $
      nf sort [1000, 999..1]

Evaluation strategies (required - specify how values are forced):

  • nf f x - Force result of f x to normal form (deep, full evaluation)
  • whnf f x - Force result of f x to weak head normal form (shallow, outermost constructor only)
  • nfIO action - Execute IO action and force result to normal form
  • whnfIO action - Execute IO action and force result to WHNF
  • nfAppIO f x - Apply function, execute resulting IO, force result to normal form
  • whnfAppIO f x - Apply function, execute resulting IO, force result to WHNF
  • io action - Plain IO action without additional forcing

Why evaluation strategies matter: Without forcing, GHC may optimize away computations or share results across iterations, making benchmarks meaningless. Use nf for most cases unless you specifically want lazy evaluation (whnf).

First run creates .golden/<arch>/list-append.golden with baseline stats.
Subsequent runs compare against baseline. Test fails if mean time changes beyond tolerance (default: ±15% OR ±0.01ms).

Output format :

Metric  Baseline    Actual      Diff
------  --------    ------      ----
Mean    0.150 ms  0.170 ms   +13.3%

Update baselines after intentional changes:

GOLDS_GYM_ACCEPT=1 stack test

How It Works

Golden files store timing statistics per architecture (e.g., .golden/aarch64-darwin-Apple_M1/):

{
  "mean": 1.234,
  "stddev": 0.056,
  "median": 1.201,
  "architecture": "aarch64-darwin-Apple_M1",
  "timestamp": "2026-01-30T12:00:00Z"
}

Hybrid tolerance (default) prevents false failures: benchmarks pass if within ±15% OR ±0.01ms. This handles measurement noise for fast operations (<1ms) while catching real regressions for slower code.

Configuration

Key BenchConfig options:

Field Default Description
iterations 100 Number of benchmark iterations
tolerancePercent 15.0 Allowed mean time deviation (%)
absoluteToleranceMs Just 0.01 Absolute tolerance (ms) - enables hybrid mode
useRobustStatistics False Use trimmed mean/MAD instead of mean/stddev
warmupIterations 5 Warm-up runs before measurement

See BenchConfig type for all options.

Environment variables:

  • GOLDS_GYM_ACCEPT=1 - Regenerate all golden files
  • GOLDS_GYM_SKIP=1 - Skip benchmarks entirely (useful in CI)

Advanced: Robust Statistics

Standard mean/stddev are sensitive to outliers (GC pauses, OS scheduling). Robust statistics provide outlier-resistant comparisons:

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Use trimmed mean + MAD
  , trimPercent = 10.0          -- Remove top/bottom 10%
  , outlierThreshold = 3.0      -- Flag outliers >3 MADs from median
  }
  "noisy benchmark" $
  nf computation input

When to use:

  • Benchmarking in noisy environments (shared CI, development machines)
  • Operations with occasional GC pauses or system interruptions
  • Fast operations (<1ms) with high variance
  • You see outliers in test output warnings

Advanced: Lens-Based Expectations

For fine-grained control, use lens-based expectations to assert custom performance requirements:

import Test.Hspec.BenchGolden.Lenses

-- Compare by median instead of mean (more robust)
benchGoldenWithExpectation "median comparison" defaultBenchConfig
  [expect _statsMedian (Percent 10.0)]
  (nf myAlgorithm input)

-- Compose multiple requirements (both must pass)
benchGoldenWithExpectation "strict requirements" defaultBenchConfig
  [ expect _statsMean (Percent 15.0) &&~
    expect _statsIQR (Absolute 0.1)     -- Low variance required
  ]
  (nf criticalFunction data)

Available lenses: _statsMean, _statsMedian, _statsTrimmedMean, _statsStddev, _statsMAD, _statsIQR, _statsMin, _statsMax

Tolerance types:

  • Percent 15.0 - Within ±15%
  • Absolute 0.01 - Within ±0.01ms
  • Hybrid 15.0 0.01 - Within ±15% OR ±0.01ms
  • MustImprove 10.0 - Must be ≥10% faster (for testing optimizations)
  • MustRegress 5.0 - Must be ≥5% slower (for accepting controlled regressions)

Composition: (&&~) for AND, (||~) for OR

Documentation

License

MIT