Changelog of @hackage/data-sketches 0.4.0.0

Changelog for data-sketches

0.4.0.0

New sketch families

  • KLL Sketch (DataSketches.Quantiles.KLL): Quantiles with uniform additive error bounds (~1.3% at k=200). C backend, faster than Java DataSketches at small-to-medium N.

  • HyperLogLog (DataSketches.Distinct.HyperLogLog): Cardinality estimation. ~1.6% error with 4KB (p=12). C backend.

  • Theta Sketch (DataSketches.Distinct.Theta): Distinct counting with set operations — union, intersection, difference. C backend.

  • Count-Min Sketch (DataSketches.Frequencies.CountMin): Frequency estimation that may overcount but never undercounts. C backend.

Bug fixes in REQ Sketch

  • Off-by-one in getCountWithCriterion: Binary search read past the active buffer region when spaceAtBottom=False (LowRanksAreAccurate mode), causing intermittent incorrect rank calculations from uninitialized memory.

  • growUntil in merge didn't loop: Only added one compactor level instead of recursing to the target. Silently dropped data from higher-level compactors when merging sketches with very different level counts.

  • Max value comparison inverted in merge: otherMax < thisMax instead of otherMax > thisMax caused the sketch to track the wrong maximum after merging.

  • compress captured compactors vector once: Replaced imapM_ over a snapshot with a loop that re-reads the vector each iteration, matching the Java implementation's dynamic size check.

Cross-validation

  • Hedgehog property-based tests comparing Haskell outputs against Java DataSketches 6.1.1 reference. Exact equality in exact mode (no compaction); independent ground-truth validation in estimation mode.

Performance

  • REQ insert: ~3x slower than Java (Haskell implementation with packed fields)
  • KLL insert/100: 13x faster than Java (C implementation)
  • KLL rank queries: 2.6x faster than Java
  • HLL insert/1k: 8x faster than Java

0.3.1.0

  • Initial public API for REQ sketch.

0.3.0.0

  • Internal refactoring.