Benchmarks
Tack benchmarked against real open-source projects.
Tack is benchmarked against real open-source feature implementations. Each benchmark targets a specific feature in a real project, with frozen plan shapes and deterministic validation.
How Benchmarks Work
- Target selection — a real feature from a real open-source project
- Frozen plan shape — the expected stream decomposition is fixed for apples-to-apples comparison
- Objective submitted — Tack runs discovery, planning, execution, review, merge
- Validation — integration tests run against the merged result
- Report generated — full telemetry: streams, recovery events, duration, human interventions
Results
lazygit: command-log navigation keybindings
Implement scroll mechanics, scoped keybindings, and regression coverage for the command log panel in lazygit.
| Metric | Value |
|---|---|
| Streams | 3 planned, 3 merged |
| Recovery events | 0 |
| Human interventions | 0 |
| Duration | 18m 22s |
| Validation | go test ./pkg/gui/... -count=1 — passed |
Every stream succeeded on first pass. No reviewer rejections, no quality-gate failures, no human guidance needed.
lazygit: undo basic commit/checkout
Narrow reflog undo to plain commit and checkout, add integration tests, and update user-facing docs. A harder benchmark involving core git operations.
| Metric | Value |
|---|---|
| Streams | 3 planned, 3 merged |
| Recovery events | 5 automatic (4 review rejections self-healed) |
| Human interventions | 0 |
| Duration | 1h 6m 53s |
| Final validation | headless integration tests — passed |
The harness self-healed through 4 reviewer rejections and 1 gate failure without any human intervention. Final validation confirmed correct behavior against lazygit's canonical integration test suite.
What This Proves
- First-pass success is achievable — the command-log benchmark had zero recovery events
- Self-healing works — the undo benchmark recovered from 5 failures automatically
- Zero human intervention across both benchmarks — the harness handled everything
- Real validation, not vibes — integration tests confirm merged code actually works
Methodology
- Benchmarks use frozen plan shapes to prevent the planner from finding easier decompositions
- Quality gates and validation commands are defined per benchmark
- Benchmark reports are generated artifacts, not manually written summaries
- The benchmark runner is part of Tack's codebase (
internal/benchmark/)
Learn More
- How It Works — the full pipeline from objective to PR
- Quality Gates — how gates validate agent output
- Recovery & Retries — how the harness self-heals