CC
Claude Code
v2.1.88
Claude CodeQuery Loop

Query Loop

query.ts

The core agentic execution cycle — how messages flow from user input through the API to tool execution and back. The main loop lives in query.ts (~1700 lines).

TL;DR — Key Takeaways
  • The main loop lives in query.ts (~1700 lines) and runs 7 phases every turn: project context, check compaction, stream API, error recovery, execute tools, inject attachments, then decide to continue or exit.
  • Tools start running BEFORE the model finishes — StreamingToolExecutor queues tool_use blocks as they arrive from the stream, cutting total latency.
  • Recovery is a 4-step cascade: drain collapses → reactive compact → escalate token limit 8K→64K → inject 'continue' (max 3×). The loop never gives up easily.
  • Stop hooks run even after the model appears done — they can force the loop to continue, giving external processes a chance to inject more work.
query.ts
Main File
~1700 lines
7
Loop Phases
per turn
8
Exit States
terminal conditions
4
Recovery Steps
cascade strategy

Loop State Machine

Read this first if you want to understand what the loop remembers and why retries behave differently across turns.

typescript
type LoopState = {
  messages: Message[]
  toolUseContext: ToolUseContext
  autoCompactTracking?: AutoCompactTrackingState
  maxOutputTokensRecoveryCount: number   // max 3 retries
  hasAttemptedReactiveCompact: boolean
  maxOutputTokensOverride?: number       // escalate 8K → 64K
  pendingToolUseSummary?: Promise<ToolUseSummaryMessage | null>
  stopHookActive?: boolean
  turnCount: number
  transition?: Continue                  // why previous iteration continued
}

The important detail is that query() carries recovery state between iterations. It doesn't just stream once and stop: it remembers whether auto-compact already fired, whether reactive compact was attempted, whether output tokens were escalated, and why the previous loop continued.

Loop Iteration Flow

This is the operational walkthrough of a single turn, from message projection to exit or continuation.

1

Context Projection — trim what the model sees

Extract messages after the compact boundary. Apply tool-result budgets, history snipping, microcompact, and context collapse. Goal: fit within token limit before hitting the API.

then
2

Auto-Compaction Check — summarize if still too big

If context still exceeds threshold (model_ctx − max_output − 13K buffer), trigger async autocompact: fork a summarizer, replace messages with compact version.

then
3

API Streaming — model generates + tools start early

Stream text/tool_use/thinking blocks from the API. StreamingToolExecutor starts tools as blocks arrive — tools run in parallel with continued model streaming, cutting latency.

then
4

Error Recovery — 4 escalating strategies

On overflow: (1) Drain staged collapses. (2) Reactive compact — full summary. (3) Escalate output limit 8K → 64K. (4) Inject 'continue', max 3×. Each tried before giving up.

then
5

Tool Execution — reads parallel, writes serial

Partition tool calls by concurrency safety. Read-only: up to 10 parallel. Write tools: serial with context modifiers between batches. Results yielded as messages.

then
6

Attachment Processing — inject queued context

Append memory prefetch results, skill discovery output, and queued task notifications before the next API call. Keeps the model informed without slowing the user turn.

then
7

Continuation Decision — exit or loop again

No tool use → check for natural completion. Run stop hooks (may force continuation). Check token budget. Return a terminal state or restart the loop.

Streaming Tool Execution

This section explains the main latency trick: tools can start before the model has fully finished speaking.

Timeline — tools start before model finishes

Model
Tool 1
Tool 2
streaming
tool exec

The StreamingToolExecutor is a key innovation — tools start executing while the model is still generating tokens. This significantly reduces end-to-end latency.

typescript
// StreamingToolExecutor.ts (226 lines)

class StreamingToolExecutor {
  // Queue management
  addTool(block, assistantMessage)    // Enqueue when tool_use block arrives
  processQueue()                      // Start tools respecting concurrency
  getCompletedResults()               // Yield finished results immediately

  // Concurrency enforcement
  // Non-concurrent tools: wait for exclusive access
  // Concurrent-safe tools: run in parallel with other safe tools

  // Fallback handling
  discard()                           // Discard pending on streaming fallback
  // Generates synthetic error results for in-flight tools
}

Error Recovery Cascade

When things go wrong, the loop tries 4 recovery strategies in order — each more aggressive. Think of it as a funnel: gentle first, nuclear last.

1Collapse Drain

Drain staged context collapses

if still failing
2Reactive Compact

Full conversation summary

if still failing
3Token Escalation

8K → 64K one-shot

if still failing
4Multi-turn

Inject 'continue', max 3x

Token Budget Continuation

Claude Code now reasons about turn-level budget, not only hard model limits.

Newer Claude Code versions don't only stop on model limits. They also track a per-turn token budget and can proactively inject a continuation nudge before the assistant appears done, then stop once progress shows diminishing returns.

typescript
// query/tokenBudget.ts
COMPLETION_THRESHOLD = 0.9
DIMINISHING_THRESHOLD = 500

checkTokenBudget(tracker, agentId, budget, globalTurnTokens)

// continue when:
// - main thread only (no subagent)
// - budget exists and > 0
// - turn is still below 90% of budget
// - token gain is still meaningful

// stop when:
// - diminishing returns detected
// - or a prior continuation already happened and the turn is now wrapping up

// tracker remembers:
continuationCount
lastDeltaTokens
lastGlobalTurnTokens
startedAt

Stop Hooks & Background Work

Use this section to understand why 'done' in the loop is not the true end of a turn.

The loop's 'done' path is not really the end. handleStopHooks() can still prevent continuation, summarize hook output, snapshot cache-safe params for fork reuse, kick off prompt suggestions, extract memories, and run auto-dream style background maintenance.

typescript
// query/stopHooks.ts
handleStopHooks(...)
  → saveCacheSafeParams(createCacheSafeParams(stopHookContext))
  → executePromptSuggestion(stopHookContext)
  → executeExtractMemories(stopHookContext)
  → executeAutoDream(stopHookContext)
  → executeStopHooks(permissionMode, signal, ...)
  → emit hook progress / attachment messages
  → optionally prevent continuation

Loop Exit Conditions

8 ways the loop can end. Only 'completed' is truly happy-path.

Happy path
completed

Natural end of response — no more tool calls, no forced stop

Forced terminations (7 types)

User abort · context overflow · token limits · hook rejection

completed

Natural end of response

prompt_too_long

Unrecoverable context overflow

max_output_tokens

Output limit exhausted after recovery

aborted_streaming

User interrupt during model call

aborted_tools

User interrupt during tool execution

stop_hook_prevented

Hook rejected continuation

blocking_limit

Hard context limit hit

token_budget_completed

Token budget exhausted

Token Budget Math

How the 200K context window is actually divided.

Claude's 200K context window sounds vast — but compaction triggers well before you reach it. Here's the real math:

200K token window breakdown
Conversation: 164K
Max Output: 16K
Summary Buffer: 20K
Total context window200,000
Reserved: max output tokens−16,000
Reserved: summary buffer−20,000
Effective context for conversation≈ 164,000

Auto-compaction triggers at context_size > (model_limit - max_output - 13K buffer). With Claude 3.5 Sonnet (200K, 8K output), that fires around 179K tokens — leaving 87% utilization before compaction.

Message Flow Example

typescript
User: "write a hello.py file"
    ↓
QueryEngine.submitMessage(prompt)
    ↓
fetchSystemPromptParts() → [default prompt + 50 tools]
    ↓
processUserInput() → [user message + attachments]
    ↓
yield buildSystemInitMessage()
    ↓
query() loop iteration 1:
  ─ prepend user context (cwd, platform, git status)
  ─ call queryModelWithStreaming()
  ─ stream: "I'll create a Python file..."
  ─ stream: tool_use { name: "Write", input: { file_path, content } }
      ├─ addTool() to StreamingToolExecutor
      └─ model continues streaming...
  ─ tool completes → tool_result message
  ─ yield tool_result
    ↓
  ─ getAttachmentMessages() → file change notification
  ─ yield attachment message
    ↓
  ─ needsFollowUp = false (no more tool calls)
  ─ stop hooks pass
  ─ return { reason: 'completed' }
    ↓
Session ends, messages persisted to transcript.jsonl