Tack
Concepts

Recovery & Retries

How Tack handles failures, retries, and human-guided recovery.

Failures happen. Agents produce code that does not compile, tests fail, reviewers reject implementations, and sandboxes crash. Tack's recovery system handles all of these automatically — and when it cannot, it pauses for your input instead of silently failing or looping forever.

How Recovery Works

When a step fails, Tack records a recovery attempt in a durable ledger. Each attempt captures what went wrong, what action to take, and whether it succeeded. This ledger survives daemon restarts.

The recovery flow:

yes yes no no Step fails Record failure in recovery ledger Classify failure kind Apply retry policy Retries remain? Automatic recovery Rerun step or builder Recovered? Continue workflow Block for human guidance Operator sends guidance

Failure Kinds and Actions

Not all failures are treated the same. Tack categorizes failures and picks the appropriate response:

Failure KindWhat HappenedRecovery Action
quality_gate_failureLint, typecheck, or tests failedRerun the builder with gate output attached
review_rejectionReviewer agent found problemsRerun the builder with reviewer feedback
agent_runtime_transientAgent process crashed or timed outRetry the same step
sandbox_failureSandbox environment issueRetry the same step
provider_rate_limitLLM provider rate limit hitRetry the same step (with backoff)
merge_conflictGit merge conflicts during integrationRetry the merge
post_merge_gate_failureQuality gates failed after mergingRestart the stream

Retry Profiles

Blueprints define a retry profile that controls how aggressive automatic recovery is:

ProfileMax AttemptsBehavior
strict2Fails fast, escalates quickly
balanced4Default for top-level workflows
self_healing6Default for build-review loops, keeps trying

The shipped build-review blueprint uses self_healing because gate failures and reviewer rejections are expected and fixable. The top-level standard workflow uses balanced because failures at that level are more serious.

You can override the profile in custom blueprints:

id: build-review
retry:
  profile: strict

Or set per-step overrides:

- id: lint
  type: deterministic
  action: run_quality_gates
  retry:
    max_attempts: 3
    on_exhausted: ask_human
  on_fail: build
  max_fix_iterations: 5

What You See

In tack watch

Recovery events appear in the live stream:

18:16:38 [stream-1] quality_gate_failure -> rerun_previous_agent (1/4)
18:16:45 [stream-1] recovery attempt: builder rerun with gate context
18:16:52 [stream-1] lint passed ✓

When retries exhaust:

18:16:52 [stream-1] recovery blocked: retry budget exhausted; awaiting human guidance

In tack mail <agent-name>

Blocked recovery sends an escalation to the relevant agent mailbox:

tack mail builder-stream-1

This shows the failure details, the recovery attempts that were tried, and space for your guidance.

Sending Guidance

When a run is blocked, you can send guidance to unblock it:

tack mail send builder-stream-1 "Guidance" "Check the import paths — the module moved to src/lib/" --objective <id>

Tack attaches your guidance to the next recovery attempt and resumes the run.

Fix Loops vs Recovery

There are two related but distinct mechanisms:

Fix loops are blueprint-level cycles. When a quality gate fails, the on_fail: build directive sends the agent back to the builder step with the failure context. The max_fix_iterations field caps how many times this cycle runs.

Recovery is the durable ledger that records every attempt and can escalate to humans when automatic fixes are not working. Recovery wraps around fix loops — if the fix loop exhausts its iterations, recovery takes over and may block for human input.

In practice:

  1. Gate fails → fix loop sends agent back to builder (attempt 1)
  2. Gate fails again → fix loop sends agent back to builder (attempt 2)
  3. Fix loop exhausted → recovery ledger records the failure
  4. Recovery may retry the whole step or block for human guidance

Exhaustion Modes

When all automatic recovery is exhausted, the blueprint defines what happens next:

ModeBehavior
ask_humanRun blocks. You receive an escalation and can send guidance to resume.
escalateRun is marked as needing attention but continues trying other streams.
failStream is marked as permanently failed.

Recovery After Daemon Restart

Recovery state is persisted to the database, not held in memory. If the daemon restarts:

  1. Active runs are reconciled against their last known state
  2. In-flight agent executions are requeued
  3. Blocked runs remain blocked with their full context intact
  4. You can resume by sending guidance via tack mail send

No recovery context is lost on restart.

Objective-Level Outcomes

When recovery plays out across all streams, the objective ends in one of these states:

StateMeaning
completedAll streams succeeded and merged
partialSome streams succeeded, some failed. Merged streams are preserved.
failedAll streams failed or the run could not proceed
blockedOne or more streams are blocked waiting for human guidance

For partial objectives, preserve the merged work and create a follow-up objective for the remaining work. For blocked objectives, send guidance to unblock.

On this page