Query Loop
query.tsThe core agentic execution cycle — how messages flow from user input through the API to tool execution and back. The main loop lives in query.ts (~1700 lines).
- The main loop lives in query.ts (~1700 lines) and runs 7 phases every turn: project context, check compaction, stream API, error recovery, execute tools, inject attachments, then decide to continue or exit.
- Tools start running BEFORE the model finishes — StreamingToolExecutor queues tool_use blocks as they arrive from the stream, cutting total latency.
- Recovery is a 4-step cascade: drain collapses → reactive compact → escalate token limit 8K→64K → inject 'continue' (max 3×). The loop never gives up easily.
- Stop hooks run even after the model appears done — they can force the loop to continue, giving external processes a chance to inject more work.
Loop State Machine
Read this first if you want to understand what the loop remembers and why retries behave differently across turns.
type LoopState = {
messages: Message[]
toolUseContext: ToolUseContext
autoCompactTracking?: AutoCompactTrackingState
maxOutputTokensRecoveryCount: number // max 3 retries
hasAttemptedReactiveCompact: boolean
maxOutputTokensOverride?: number // escalate 8K → 64K
pendingToolUseSummary?: Promise<ToolUseSummaryMessage | null>
stopHookActive?: boolean
turnCount: number
transition?: Continue // why previous iteration continued
}The important detail is that query() carries recovery state between iterations. It doesn't just stream once and stop: it remembers whether auto-compact already fired, whether reactive compact was attempted, whether output tokens were escalated, and why the previous loop continued.
Loop Iteration Flow
This is the operational walkthrough of a single turn, from message projection to exit or continuation.
Context Projection — trim what the model sees
Extract messages after the compact boundary. Apply tool-result budgets, history snipping, microcompact, and context collapse. Goal: fit within token limit before hitting the API.
Auto-Compaction Check — summarize if still too big
If context still exceeds threshold (model_ctx − max_output − 13K buffer), trigger async autocompact: fork a summarizer, replace messages with compact version.
API Streaming — model generates + tools start early
Stream text/tool_use/thinking blocks from the API. StreamingToolExecutor starts tools as blocks arrive — tools run in parallel with continued model streaming, cutting latency.
Error Recovery — 4 escalating strategies
On overflow: (1) Drain staged collapses. (2) Reactive compact — full summary. (3) Escalate output limit 8K → 64K. (4) Inject 'continue', max 3×. Each tried before giving up.
Tool Execution — reads parallel, writes serial
Partition tool calls by concurrency safety. Read-only: up to 10 parallel. Write tools: serial with context modifiers between batches. Results yielded as messages.
Attachment Processing — inject queued context
Append memory prefetch results, skill discovery output, and queued task notifications before the next API call. Keeps the model informed without slowing the user turn.
Continuation Decision — exit or loop again
No tool use → check for natural completion. Run stop hooks (may force continuation). Check token budget. Return a terminal state or restart the loop.
Streaming Tool Execution
This section explains the main latency trick: tools can start before the model has fully finished speaking.
Timeline — tools start before model finishes
The StreamingToolExecutor is a key innovation — tools start executing while the model is still generating tokens. This significantly reduces end-to-end latency.
// StreamingToolExecutor.ts (226 lines)
class StreamingToolExecutor {
// Queue management
addTool(block, assistantMessage) // Enqueue when tool_use block arrives
processQueue() // Start tools respecting concurrency
getCompletedResults() // Yield finished results immediately
// Concurrency enforcement
// Non-concurrent tools: wait for exclusive access
// Concurrent-safe tools: run in parallel with other safe tools
// Fallback handling
discard() // Discard pending on streaming fallback
// Generates synthetic error results for in-flight tools
}Error Recovery Cascade
When things go wrong, the loop tries 4 recovery strategies in order — each more aggressive. Think of it as a funnel: gentle first, nuclear last.
Drain staged context collapses
Full conversation summary
8K → 64K one-shot
Inject 'continue', max 3x
Token Budget Continuation
Claude Code now reasons about turn-level budget, not only hard model limits.
Newer Claude Code versions don't only stop on model limits. They also track a per-turn token budget and can proactively inject a continuation nudge before the assistant appears done, then stop once progress shows diminishing returns.
// query/tokenBudget.ts
COMPLETION_THRESHOLD = 0.9
DIMINISHING_THRESHOLD = 500
checkTokenBudget(tracker, agentId, budget, globalTurnTokens)
// continue when:
// - main thread only (no subagent)
// - budget exists and > 0
// - turn is still below 90% of budget
// - token gain is still meaningful
// stop when:
// - diminishing returns detected
// - or a prior continuation already happened and the turn is now wrapping up
// tracker remembers:
continuationCount
lastDeltaTokens
lastGlobalTurnTokens
startedAtStop Hooks & Background Work
Use this section to understand why 'done' in the loop is not the true end of a turn.
The loop's 'done' path is not really the end. handleStopHooks() can still prevent continuation, summarize hook output, snapshot cache-safe params for fork reuse, kick off prompt suggestions, extract memories, and run auto-dream style background maintenance.
// query/stopHooks.ts
handleStopHooks(...)
→ saveCacheSafeParams(createCacheSafeParams(stopHookContext))
→ executePromptSuggestion(stopHookContext)
→ executeExtractMemories(stopHookContext)
→ executeAutoDream(stopHookContext)
→ executeStopHooks(permissionMode, signal, ...)
→ emit hook progress / attachment messages
→ optionally prevent continuationLoop Exit Conditions
8 ways the loop can end. Only 'completed' is truly happy-path.
completedNatural end of response — no more tool calls, no forced stop
User abort · context overflow · token limits · hook rejection
completedNatural end of response
prompt_too_longUnrecoverable context overflow
max_output_tokensOutput limit exhausted after recovery
aborted_streamingUser interrupt during model call
aborted_toolsUser interrupt during tool execution
stop_hook_preventedHook rejected continuation
blocking_limitHard context limit hit
token_budget_completedToken budget exhausted
Token Budget Math
How the 200K context window is actually divided.
Claude's 200K context window sounds vast — but compaction triggers well before you reach it. Here's the real math:
200,000−16,000−20,000≈ 164,000Auto-compaction triggers at context_size > (model_limit - max_output - 13K buffer). With Claude 3.5 Sonnet (200K, 8K output), that fires around 179K tokens — leaving 87% utilization before compaction.
Message Flow Example
User: "write a hello.py file"
↓
QueryEngine.submitMessage(prompt)
↓
fetchSystemPromptParts() → [default prompt + 50 tools]
↓
processUserInput() → [user message + attachments]
↓
yield buildSystemInitMessage()
↓
query() loop iteration 1:
─ prepend user context (cwd, platform, git status)
─ call queryModelWithStreaming()
─ stream: "I'll create a Python file..."
─ stream: tool_use { name: "Write", input: { file_path, content } }
├─ addTool() to StreamingToolExecutor
└─ model continues streaming...
─ tool completes → tool_result message
─ yield tool_result
↓
─ getAttachmentMessages() → file change notification
─ yield attachment message
↓
─ needsFollowUp = false (no more tool calls)
─ stop hooks pass
─ return { reason: 'completed' }
↓
Session ends, messages persisted to transcript.jsonl