Recovery & Retries
How Tack handles failures, retries, and human-guided recovery.
Failures happen. Agents produce code that does not compile, tests fail, reviewers reject implementations, and sandboxes crash. Tack's recovery system handles all of these automatically — and when it cannot, it pauses for your input instead of silently failing or looping forever.
How Recovery Works
When a step fails, Tack records a recovery attempt in a durable ledger. Each attempt captures what went wrong, what action to take, and whether it succeeded. This ledger survives daemon restarts.
The recovery flow:
Failure Kinds and Actions
Not all failures are treated the same. Tack categorizes failures and picks the appropriate response:
| Failure Kind | What Happened | Recovery Action |
|---|---|---|
quality_gate_failure | Lint, typecheck, or tests failed | Rerun the builder with gate output attached |
review_rejection | Reviewer agent found problems | Rerun the builder with reviewer feedback |
agent_runtime_transient | Agent process crashed or timed out | Retry the same step |
sandbox_failure | Sandbox environment issue | Retry the same step |
provider_rate_limit | LLM provider rate limit hit | Retry the same step (with backoff) |
merge_conflict | Git merge conflicts during integration | Retry the merge |
post_merge_gate_failure | Quality gates failed after merging | Restart the stream |
Retry Profiles
Blueprints define a retry profile that controls how aggressive automatic recovery is:
| Profile | Max Attempts | Behavior |
|---|---|---|
strict | 2 | Fails fast, escalates quickly |
balanced | 4 | Default for top-level workflows |
self_healing | 6 | Default for build-review loops, keeps trying |
The shipped build-review blueprint uses self_healing because gate failures and reviewer rejections are expected and fixable. The top-level standard workflow uses balanced because failures at that level are more serious.
You can override the profile in custom blueprints:
id: build-review
retry:
profile: strictOr set per-step overrides:
- id: lint
type: deterministic
action: run_quality_gates
retry:
max_attempts: 3
on_exhausted: ask_human
on_fail: build
max_fix_iterations: 5What You See
In tack watch
Recovery events appear in the live stream:
18:16:38 [stream-1] quality_gate_failure -> rerun_previous_agent (1/4)
18:16:45 [stream-1] recovery attempt: builder rerun with gate context
18:16:52 [stream-1] lint passed ✓When retries exhaust:
18:16:52 [stream-1] recovery blocked: retry budget exhausted; awaiting human guidanceIn tack mail <agent-name>
Blocked recovery sends an escalation to the relevant agent mailbox:
tack mail builder-stream-1This shows the failure details, the recovery attempts that were tried, and space for your guidance.
Sending Guidance
When a run is blocked, you can send guidance to unblock it:
tack mail send builder-stream-1 "Guidance" "Check the import paths — the module moved to src/lib/" --objective <id>Tack attaches your guidance to the next recovery attempt and resumes the run.
Fix Loops vs Recovery
There are two related but distinct mechanisms:
Fix loops are blueprint-level cycles. When a quality gate fails, the on_fail: build directive sends the agent back to the builder step with the failure context. The max_fix_iterations field caps how many times this cycle runs.
Recovery is the durable ledger that records every attempt and can escalate to humans when automatic fixes are not working. Recovery wraps around fix loops — if the fix loop exhausts its iterations, recovery takes over and may block for human input.
In practice:
- Gate fails → fix loop sends agent back to builder (attempt 1)
- Gate fails again → fix loop sends agent back to builder (attempt 2)
- Fix loop exhausted → recovery ledger records the failure
- Recovery may retry the whole step or block for human guidance
Exhaustion Modes
When all automatic recovery is exhausted, the blueprint defines what happens next:
| Mode | Behavior |
|---|---|
ask_human | Run blocks. You receive an escalation and can send guidance to resume. |
escalate | Run is marked as needing attention but continues trying other streams. |
fail | Stream is marked as permanently failed. |
Recovery After Daemon Restart
Recovery state is persisted to the database, not held in memory. If the daemon restarts:
- Active runs are reconciled against their last known state
- In-flight agent executions are requeued
- Blocked runs remain blocked with their full context intact
- You can resume by sending guidance via
tack mail send
No recovery context is lost on restart.
Objective-Level Outcomes
When recovery plays out across all streams, the objective ends in one of these states:
| State | Meaning |
|---|---|
completed | All streams succeeded and merged |
partial | Some streams succeeded, some failed. Merged streams are preserved. |
failed | All streams failed or the run could not proceed |
blocked | One or more streams are blocked waiting for human guidance |
For partial objectives, preserve the merged work and create a follow-up objective for the remaining work. For blocked objectives, send guidance to unblock.