Tack

Benchmarks

Tack benchmarked against real open-source projects.

Tack is benchmarked against real open-source feature implementations. Each benchmark targets a specific feature in a real project, with frozen plan shapes and deterministic validation.

How Benchmarks Work

  1. Target selection — a real feature from a real open-source project
  2. Frozen plan shape — the expected stream decomposition is fixed for apples-to-apples comparison
  3. Objective submitted — Tack runs discovery, planning, execution, review, merge
  4. Validation — integration tests run against the merged result
  5. Report generated — full telemetry: streams, recovery events, duration, human interventions

Results

lazygit: command-log navigation keybindings

Implement scroll mechanics, scoped keybindings, and regression coverage for the command log panel in lazygit.

MetricValue
Streams3 planned, 3 merged
Recovery events0
Human interventions0
Duration18m 22s
Validationgo test ./pkg/gui/... -count=1 — passed

Every stream succeeded on first pass. No reviewer rejections, no quality-gate failures, no human guidance needed.

Full report

lazygit: undo basic commit/checkout

Narrow reflog undo to plain commit and checkout, add integration tests, and update user-facing docs. A harder benchmark involving core git operations.

MetricValue
Streams3 planned, 3 merged
Recovery events5 automatic (4 review rejections self-healed)
Human interventions0
Duration1h 6m 53s
Final validationheadless integration tests — passed

The harness self-healed through 4 reviewer rejections and 1 gate failure without any human intervention. Final validation confirmed correct behavior against lazygit's canonical integration test suite.

Full report

What This Proves

  • First-pass success is achievable — the command-log benchmark had zero recovery events
  • Self-healing works — the undo benchmark recovered from 5 failures automatically
  • Zero human intervention across both benchmarks — the harness handled everything
  • Real validation, not vibes — integration tests confirm merged code actually works

Methodology

  • Benchmarks use frozen plan shapes to prevent the planner from finding easier decompositions
  • Quality gates and validation commands are defined per benchmark
  • Benchmark reports are generated artifacts, not manually written summaries
  • The benchmark runner is part of Tack's codebase (internal/benchmark/)

Learn More

On this page