Query Loop

query.ts

The core agentic execution cycle — how messages flow from user input through the API to tool execution and back. The main loop lives in query.ts (~1700 lines).

query.ts QueryEngine.ts query/tokenBudget.ts query/stopHooks.ts

TL;DR — Key Takeaways

The main loop lives in query.ts (~1700 lines) and runs 7 phases every turn: project context, check compaction, stream API, error recovery, execute tools, inject attachments, then decide to continue or exit.
Tools start running BEFORE the model finishes — StreamingToolExecutor queues tool_use blocks as they arrive from the stream, cutting total latency.
Recovery is a 4-step cascade: drain collapses → reactive compact → escalate token limit 8K→64K → inject 'continue' (max 3×). The loop never gives up easily.
Stop hooks run even after the model appears done — they can force the loop to continue, giving external processes a chance to inject more work.

Jump To

State Machine

What state the loop carries between turns.

Iteration Flow

The end-to-end path from context projection to continuation.

Streaming Tools

How tool execution overlaps with model streaming.

Token Budget

Why the loop sometimes nudges itself to continue.

Stop Hooks

What still runs after the answer looks finished.

query.ts

Main File

~1700 lines

Loop Phases

per turn

Exit States

terminal conditions

Recovery Steps

cascade strategy

Loop State Machine

Read this first if you want to understand what the loop remembers and why retries behave differently across turns.

query.ts QueryEngine.ts

typescript

type LoopState = {
  messages: Message[]
  toolUseContext: ToolUseContext
  autoCompactTracking?: AutoCompactTrackingState
  maxOutputTokensRecoveryCount: number   // max 3 retries
  hasAttemptedReactiveCompact: boolean
  maxOutputTokensOverride?: number       // escalate 8K → 64K
  pendingToolUseSummary?: Promise<ToolUseSummaryMessage | null>
  stopHookActive?: boolean
  turnCount: number
  transition?: Continue                  // why previous iteration continued
}

The important detail is that query() carries recovery state between iterations. It doesn't just stream once and stop: it remembers whether auto-compact already fired, whether reactive compact was attempted, whether output tokens were escalated, and why the previous loop continued.

Loop Iteration Flow

This is the operational walkthrough of a single turn, from message projection to exit or continuation.

Context Projection — trim what the model sees

Extract messages after the compact boundary. Apply tool-result budgets, history snipping, microcompact, and context collapse. Goal: fit within token limit before hitting the API.

then

Auto-Compaction Check — summarize if still too big

If context still exceeds threshold (model_ctx − max_output − 13K buffer), trigger async autocompact: fork a summarizer, replace messages with compact version.

then

API Streaming — model generates + tools start early

Stream text/tool_use/thinking blocks from the API. StreamingToolExecutor starts tools as blocks arrive — tools run in parallel with continued model streaming, cutting latency.

then

Error Recovery — 4 escalating strategies

On overflow: (1) Drain staged collapses. (2) Reactive compact — full summary. (3) Escalate output limit 8K → 64K. (4) Inject 'continue', max 3×. Each tried before giving up.

then

Tool Execution — reads parallel, writes serial

Partition tool calls by concurrency safety. Read-only: up to 10 parallel. Write tools: serial with context modifiers between batches. Results yielded as messages.

then

Attachment Processing — inject queued context

Append memory prefetch results, skill discovery output, and queued task notifications before the next API call. Keeps the model informed without slowing the user turn.

then

Continuation Decision — exit or loop again

No tool use → check for natural completion. Run stop hooks (may force continuation). Check token budget. Return a terminal state or restart the loop.

Streaming Tool Execution

This section explains the main latency trick: tools can start before the model has fully finished speaking.

StreamingToolExecutor.ts toolOrchestration.ts

Timeline — tools start before model finishes

Model

Tool 1

Tool 2

streaming

tool exec

The StreamingToolExecutor is a key innovation — tools start executing while the model is still generating tokens. This significantly reduces end-to-end latency.

typescript

// StreamingToolExecutor.ts (226 lines)

class StreamingToolExecutor {
  // Queue management
  addTool(block, assistantMessage)    // Enqueue when tool_use block arrives
  processQueue()                      // Start tools respecting concurrency
  getCompletedResults()               // Yield finished results immediately

  // Concurrency enforcement
  // Non-concurrent tools: wait for exclusive access
  // Concurrent-safe tools: run in parallel with other safe tools

  // Fallback handling
  discard()                           // Discard pending on streaming fallback
  // Generates synthetic error results for in-flight tools
}

Error Recovery Cascade

When things go wrong, the loop tries 4 recovery strategies in order — each more aggressive. Think of it as a funnel: gentle first, nuclear last.

1Collapse Drain

Drain staged context collapses

↓ if still failing

2Reactive Compact

Full conversation summary

↓ if still failing

3Token Escalation

8K → 64K one-shot

↓ if still failing

4Multi-turn

Inject 'continue', max 3x

Token Budget Continuation

Claude Code now reasons about turn-level budget, not only hard model limits.

query/tokenBudget.ts

Newer Claude Code versions don't only stop on model limits. They also track a per-turn token budget and can proactively inject a continuation nudge before the assistant appears done, then stop once progress shows diminishing returns.

typescript

// query/tokenBudget.ts
COMPLETION_THRESHOLD = 0.9
DIMINISHING_THRESHOLD = 500

checkTokenBudget(tracker, agentId, budget, globalTurnTokens)

// continue when:
// - main thread only (no subagent)
// - budget exists and > 0
// - turn is still below 90% of budget
// - token gain is still meaningful

// stop when:
// - diminishing returns detected
// - or a prior continuation already happened and the turn is now wrapping up

// tracker remembers:
continuationCount
lastDeltaTokens
lastGlobalTurnTokens
startedAt

Stop Hooks & Background Work

Use this section to understand why 'done' in the loop is not the true end of a turn.

query/stopHooks.ts speculation.ts extractMemories.ts

The loop's 'done' path is not really the end. handleStopHooks() can still prevent continuation, summarize hook output, snapshot cache-safe params for fork reuse, kick off prompt suggestions, extract memories, and run auto-dream style background maintenance.

typescript

// query/stopHooks.ts
handleStopHooks(...)
  → saveCacheSafeParams(createCacheSafeParams(stopHookContext))
  → executePromptSuggestion(stopHookContext)
  → executeExtractMemories(stopHookContext)
  → executeAutoDream(stopHookContext)
  → executeStopHooks(permissionMode, signal, ...)
  → emit hook progress / attachment messages
  → optionally prevent continuation

Loop Exit Conditions

8 ways the loop can end. Only 'completed' is truly happy-path.

Happy path

completed

Natural end of response — no more tool calls, no forced stop

Forced terminations (7 types)

User abort · context overflow · token limits · hook rejection

completed

Natural end of response

prompt_too_long

Unrecoverable context overflow

max_output_tokens

Output limit exhausted after recovery

aborted_streaming

User interrupt during model call

aborted_tools

User interrupt during tool execution

stop_hook_prevented

Hook rejected continuation

blocking_limit

Hard context limit hit

token_budget_completed

Token budget exhausted

Token Budget Math

How the 200K context window is actually divided.

Claude's 200K context window sounds vast — but compaction triggers well before you reach it. Here's the real math:

200K token window breakdown

Conversation ~164K

Out

Buf

Conversation: 164K

Max Output: 16K

Summary Buffer: 20K

Total context window200,000

Reserved: max output tokens−16,000

Reserved: summary buffer−20,000

Effective context for conversation≈ 164,000

Auto-compaction triggers at context_size > (model_limit - max_output - 13K buffer). With Claude 3.5 Sonnet (200K, 8K output), that fires around 179K tokens — leaving 87% utilization before compaction.

Message Flow Example

typescript

User: "write a hello.py file"
    ↓
QueryEngine.submitMessage(prompt)
    ↓
fetchSystemPromptParts() → [default prompt + 50 tools]
    ↓
processUserInput() → [user message + attachments]
    ↓
yield buildSystemInitMessage()
    ↓
query() loop iteration 1:
  ─ prepend user context (cwd, platform, git status)
  ─ call queryModelWithStreaming()
  ─ stream: "I'll create a Python file..."
  ─ stream: tool_use { name: "Write", input: { file_path, content } }
      ├─ addTool() to StreamingToolExecutor
      └─ model continues streaming...
  ─ tool completes → tool_result message
  ─ yield tool_result
    ↓
  ─ getAttachmentMessages() → file change notification
  ─ yield attachment message
    ↓
  ─ needsFollowUp = false (no more tool calls)
  ─ stop hooks pass
  ─ return { reason: 'completed' }
    ↓
Session ends, messages persisted to transcript.jsonl

Tools

43 built-in tools — how they're structured, how permissions work, why BashTool is 300KB, and how MCP plugs in external tools.