Prompted LLC

When Autonomous Agents Become Infrastructure

Q1 2026·3,000 words

InfrastructureGovernanceCoordination

Theory-Practice Synthesis: February 2026 - When Autonomous Agents Become Infrastructure

The Moment

Three research papers published in February 2026. Eighty-six percent enterprise adoption at Anthropic. Thousands of agents deployed across Amazon's operations. Something fundamental shifted in how organizations think about AI agents.

We're witnessing the operationalization inflection point—the moment when agentic AI transitions from experimental tool to foundational infrastructure. This isn't hype. The convergence of mature theoretical frameworks with production-scale deployments marks a genuine phase change in enterprise capability. The decisions made now about evaluation, governance, and human-AI coordination will shape the next decade of organizational intelligence.

This synthesis examines three theoretical advances alongside their business parallels to understand what emerges when theory meets practice at scale.

The Theoretical Advance

Paper 1: Code2World - Simulating Reality Through Code

Code2World: A GUI World Model via Renderable Code Generation (arxiv 2602.09856, February 2026)

Core Contribution: Code2World reframes world modeling as a code generation problem. Instead of predicting pixels or text, the system generates renderable HTML/CSS that represents the next visual state of a GUI. This achieves something previous approaches struggled with: simultaneously maintaining high visual fidelity and fine-grained structural controllability.

The innovation lies in the training methodology. The team constructed AndroidCode, translating 80K+ GUI trajectories into high-fidelity HTML through a visual-feedback revision mechanism. They then applied Render-Aware Reinforcement Learning, using the rendered outcome itself as the reward signal—enforcing visual semantic fidelity and action consistency simultaneously.

Why It Matters: Code2World-8B matches GPT-5 and Gemini-3-Pro-Image performance while boosting downstream navigation success rates by +9.5% on AndroidWorld benchmarks. More significantly, it demonstrates that world models don't need to simulate pixels—they can simulate the *structure* of reality through executable representations. This opens a path toward agents that understand interfaces not through computer vision alone, but through the generative logic that creates interfaces.

Paper 2: Agentic Reasoning - A Unified Framework

Agentic Reasoning for Large Language Models (arxiv 2601.12538, January 2026)

Core Contribution: This comprehensive survey redefines LLMs as autonomous agents capable of planning, acting, and learning through continuous environmental interaction. The framework organizes agentic reasoning along three complementary dimensions:

1. Environmental Dynamics (three layers):

- Foundational agentic reasoning: core single-agent capabilities (planning, tool use, search)

- Self-evolving reasoning: agents refining capabilities through feedback, memory, adaptation

- Collective multi-agent reasoning: coordination, knowledge sharing, collaborative settings

2. Reasoning Modes:

- In-context reasoning: scaling test-time interaction through structured orchestration

- Post-training reasoning: optimizing behaviors via reinforcement learning and supervised fine-tuning

3. Application Domains: science, robotics, healthcare, autonomous research, mathematics

Why It Matters: The survey synthesizes fragmented research into a unified roadmap that bridges "thought and action." It makes explicit what was implicit: that agentic systems require fundamentally different architectural patterns than conversational LLMs. The three-layer environmental model provides builders with a conceptual framework for thinking about agent complexity—helping distinguish between agents that execute plans, agents that learn from experience, and agents that coordinate with other agents.

Paper 3: Adaptation of Agentic AI - Design Strategies for Evolution

Adaptation of Agentic AI (arxiv 2512.16301, December 2025)

Core Contribution: This paper presents a systematic framework for agent and tool adaptation, clarifying how agentic systems improve performance, reliability, and generalization over time. The framework decomposes adaptation into:

Agent Adaptations:

- Tool-execution-signaled: adaptation triggered by tool call results (errors, unexpected formats, authentication failures)

- Agent-output-signaled: adaptation driven by reasoning traces, plan coherence, output quality

Tool Adaptations:

- Agent-agnostic: improving tools independently of which agent uses them

- Agent-supervised: tools that adapt based on specific agent interaction patterns

Why It Matters: The framework makes trade-offs explicit. It provides practical guidance for system designers choosing between adaptation strategies based on: data availability, computational constraints, deployment context, and acceptable failure modes. Most importantly, it acknowledges that adaptation is not optional—it's structural. Agentic systems must adapt or decay.

The Practice Mirror

Business Parallel 1: Anthropic Claude Code - Production Autonomy at Scale

Implementation: Anthropic deployed Claude Code as a production coding agent with remarkable adoption: 86% of corporate customers now use coding agents in production environments. The system demonstrates what theoretical frameworks predict but rarely achieve at scale.

Key Metrics (October 2025 - January 2026):

- 99.9th percentile turn duration doubled from under 25 minutes to over 45 minutes

- 59% productivity gains across planning, generation, and review phases

- New users auto-approve ~20% of sessions; experienced users approve ~40%

- Agent self-stops for clarification 2x more often than humans interrupt on complex tasks

Connection to Theory: Claude Code operationalizes the self-evolving reasoning layer from the Agentic Reasoning survey. The system exhibits uncertainty calibration—knowing when to pause and ask for human guidance. Anthropic's research on agent autonomy reveals a deployment overhang: agents are theoretically capable of 5-hour autonomous tasks (per METR evaluations) but operate around 45 minutes in practice. This gap represents not a capability ceiling but a trust-building phase.

Outcomes: Internal Anthropic data shows success rates on challenging tasks doubled while human intervention decreased from 5.4 to 3.3 interventions per session. The pattern reveals something counterintuitive: as agents become more capable, experienced users grant them more autonomy *while* interrupting more strategically. Oversight becomes targeted rather than constant.

Business Parallel 2: Amazon Agentic AI Evaluation Framework - Governance at Enterprise Scale

Implementation: Amazon deployed a comprehensive evaluation framework across thousands of agents spanning shopping assistants, customer service, and seller operations. The system addresses a challenge theory largely ignores: how to validate agent behavior at production scale.

Framework Components:

1. Shopping Assistant: Onboards hundreds of tools from Amazon APIs, handles multi-turn conversations, tool selection accuracy metrics

2. Customer Service: Intent detection via LLM simulator, routing correctness, resolution success rates

3. Seller Assistant: Multi-agent collaboration metrics, inter-agent communication scores, subtask completion rates

Evaluation Dimensions:

- Agent output quality (correctness, faithfulness, helpfulness, relevance)

- Task completion (goal success, goal accuracy)

- Tool use (selection accuracy, parameter accuracy, error rates, multi-turn function calling)

- Memory (context retrieval precision/recall)

- Multi-turn (topic adherence, topic refusal)

- Reasoning (grounding accuracy, faithfulness score, context score)

- Responsibility and safety (hallucination, toxicity, harmfulness)

Connection to Theory: Amazon's evaluation infrastructure exposes a critical gap in theoretical frameworks. The Adaptation of Agentic AI paper provides adaptation strategies but doesn't address *evaluation* of those adaptations at scale. Amazon built what theory assumed existed: systematic measurement of agent reasoning chains, tool selection logic, and multi-agent coordination patterns.

Outcomes: The framework enabled systematic improvement across agent deployments. For shopping assistants, automated API-to-tool onboarding using LLMs reduced months of manual work to weeks. For customer service, intent detection accuracy improved through continuous evaluation against ground truth from historical interactions. The HITL loops—where humans audit evaluation results—proved critical for catching edge cases automated metrics missed.

Business Parallel 3: Enterprise World Models - From Language to Causality

Implementation: Launch Consulting documented a fundamental shift in enterprise AI strategy: organizations are moving from language prediction to simulation-driven decision intelligence. Financial services firms model liquidity shocks, multi-agent trading behaviors, and cascading counterparty risk. Manufacturing operations deploy digital twins for predictive system optimization and pre-deployment scenario testing.

Key Insight: World models aren't replacing LLMs—they're *complementing* them. Future architectures integrate:

- Small language models for domain tasks

- Large language models for reasoning and communication

- Simulation-based world models for system orchestration

Connection to Theory: Code2World demonstrates next-state prediction through renderable code. Enterprise adoption reveals what that enables at business scale: decision rehearsal. Organizations can stress-test major decisions before capital commitment, model market expansion across hundreds of potential futures, and pre-model risk rather than post-analyze it.

Outcomes: The data shift from collection to observation becomes critical. Unlike LLMs (which tolerate noisy datasets), world models require high-fidelity behavioral telemetry, environmental signals, and agent interaction data. Organizations that fail to instrument their systems properly struggle to train reliable simulation environments. Quality matters more than volume.

The Synthesis

When we view theory and practice together, several patterns emerge that neither alone reveals:

Pattern 1: Deployment Overhang - Trust Lags Capability

Code2World's theoretical framework enables extended autonomous operation. The Agentic Reasoning survey maps self-evolving agent capabilities. Yet Anthropic's production data shows agents operate far below their capability ceiling: 45 minutes actual runtime versus 5 hours theoretical (METR evaluation).

What This Reveals: We're in a trust-building phase where organizations preserve human sovereignty even when theory proves capability. The gap isn't technical—it's organizational. Builders know agents *can* operate autonomously for hours. But they've architected systems that require human oversight anyway.

This isn't caution. It's wisdom. The transition from tool to infrastructure demands evidence of reliability at scale. Organizations that deployed agents without this evidence created brittle systems that failed in production. The deployment overhang represents learning from early failures.

Pattern 2: Self-Limiting Intelligence - Agents Govern Themselves

The Adaptation of Agentic AI paper's theoretical framework proposes various adaptation strategies. Anthropic's production data reveals which strategy wins: agent-initiated oversight. Claude self-stops for clarification 2x more often than humans interrupt on complex tasks.

What This Reveals: The most reliable governance mechanism might be the agent itself. When properly calibrated, agents recognize their own uncertainty more reliably than humans recognize when to intervene. This inverts the traditional safety paradigm—instead of building external guardrails, build internal uncertainty calibration.

Amazon's evaluation framework validates this. Their metrics measure not just outcomes but reasoning chain coherence. When agents demonstrate transparent reasoning and predictable failure modes, trust becomes measurable. This makes self-limiting intelligence a technical artifact rather than a philosophical aspiration.

Gap 1: The Evaluation Crisis - Validation Lags Innovation

Theory provides world models (Code2World), reasoning frameworks (Agentic Reasoning survey), and adaptation strategies (Adaptation of Agentic AI). Amazon's evaluation infrastructure exposes the bottleneck: theoretical advances outpace our ability to validate them at scale.

The Gap: No theoretical framework addresses how to evaluate collective behavior in multi-agent ecosystems. Amazon built thousands of agents requiring multi-dimensional evaluation (quality, performance, responsibility, cost). They constructed evaluation infrastructure from first principles because theory provided no guidance.

What This Reveals: The operationalization challenge isn't building agents—it's validating them. As agents move from experimental tools to foundational infrastructure, evaluation becomes the constraining factor. Organizations need continuous production monitoring, HITL validation, and automated anomaly detection. Theory assumes this infrastructure exists. Practice proves it doesn't.

Gap 2: From Simulation to Causality - Next-State Isn't Enough

Code2World offers GUI world models for next-state prediction. Launch Consulting's enterprise adoption reveals what organizations actually need: causal reasoning. Predicting *what* changes next doesn't suffice—organizations need to understand *why* systems change.

The Gap: Financial services firms modeling counterparty risk cascades need causal models, not just next-state predictors. Manufacturing operations optimizing supply chains need to understand causal mechanisms, not just observe correlated patterns.

What This Reveals: The shift from language models to world models represents an intermediate step. The ultimate goal is causality modeling—understanding the mechanisms that generate observed patterns. Code2World provides a bridge (structure through code rather than pixels) but doesn't close the conceptual gap between correlation and causation.

Emergence 1: The Governance Paradox - More Autonomy Requires More Oversight

Amazon's evaluation framework and Anthropic's production metrics converge on a counterintuitive finding: as agents become more capable, governance infrastructure must become *exponentially* more sophisticated—not less.

The Paradox: Early autonomous systems required simple oversight: human approval before critical actions. Advanced autonomous systems require nuanced oversight: measuring reasoning chains, validating tool selection logic, monitoring inter-agent coordination, tracking uncertainty calibration, ensuring alignment with business objectives.

What Emerges: Governance scales with capability rather than remaining constant. This has profound implications for organizational design. Companies deploying advanced agents need evaluation infrastructure, continuous monitoring systems, automated anomaly detection, HITL validation loops, and cross-functional governance teams. The cost of governance grows alongside the benefit of autonomy.

This explains why Amazon built such comprehensive evaluation infrastructure. It's not bureaucracy—it's necessity. At thousands of agents across critical business functions, governance becomes the difference between infrastructure and liability.

Emergence 2: Trust as a Technical Artifact - Architecture Over Psychology

Anthropic's data shows experienced users grant more autonomy (40% auto-approve) while interrupting more strategically (9% interrupt rate versus 5% for new users). The Agentic Reasoning survey's self-evolving framework provides the theoretical basis: agents adapt through feedback, memory, and learning.

What Emerges: Trust isn't psychological—it's architectural. Trust emerges from:

1. Agent uncertainty calibration: Systems that know when they don't know

2. Transparent reasoning traces: Legible decision-making processes

3. Predictable failure modes: Bounded risk profiles with known recovery patterns

This reframes the AI alignment problem. Instead of asking "How do we make agents safe?" ask "How do we make agent reasoning legible, uncertainty calibrated, and failure modes predictable?" These are technical properties that can be designed, measured, and validated.

Amazon's evaluation framework operationalizes this. Their metrics assess grounding accuracy (reasoning alignment with data), faithfulness score (logical consistency), and context score (each step grounded appropriately). These aren't safety theater—they're trust engineering.

Implications

For Builders: Infrastructure Demands Different Design Patterns

If agents are becoming infrastructure rather than tools, builders must adopt infrastructure-grade design patterns:

1. Evaluation-First Architecture

Don't build agents then figure out evaluation. Design evaluation infrastructure alongside agent capabilities. Amazon's approach—comprehensive metrics across quality, performance, responsibility, and cost—represents the minimum viable governance layer for production deployment.

2. Uncertainty as a First-Class Citizen

Anthropic's finding (agents self-stop 2x more than humans interrupt) proves that calibrated uncertainty beats external guardrails. Build agents that recognize and communicate their own limitations. This requires: explicit uncertainty estimation, confidence thresholding for autonomous decisions, and graceful degradation when confidence drops below thresholds.

3. Trust Through Transparency

Make reasoning chains legible. Provide tool selection justifications. Log uncertainty estimates. Expose decision pathways. Trust emerges from technical properties, not marketing claims. If you can't explain why your agent made a decision, you can't deploy it as infrastructure.

4. Adaptation as Architectural Layer

The Adaptation of Agentic AI framework isn't optional—it's structural. Agents that can't adapt will decay as environments shift. Choose adaptation strategies intentionally: tool-execution-signaled for production systems with clear failure signals, agent-output-signaled for systems where reasoning quality matters more than outcome metrics.

For Decision-Makers: The Governance Investment Thesis

The governance paradox has budget implications. Decision-makers face a choice:

Option A: Deploy agents with minimal governance infrastructure. Achieve short-term productivity gains. Risk systematic failures when agents operate outside training distributions. Discover evaluation limitations only after production incidents.

Option B: Invest in comprehensive evaluation infrastructure before wide deployment. Build continuous monitoring systems. Establish HITL validation loops. Accept higher upfront costs for lower long-term risk.

Amazon chose Option B. Their evaluation framework investment enables thousands of agents across critical business functions. The ROI calculation shifts from "cost of evaluation" to "cost of systematic failure without evaluation."

Strategic Implication: Organizations that treat agent evaluation as optional will face the same fate as organizations that treated software testing as optional in the 2000s. The debt compounds. The cost of retrofitting governance infrastructure exceeds the cost of building it correctly from the start.

For the Field: The Causal Reasoning Frontier

The shift from language models to world models represents progress but not completion. Launch Consulting's enterprise adoption data reveals the next frontier: causality modeling.

Current State: Code2World predicts next states. Agents reason about sequences of actions. Organizations simulate outcomes across parameter spaces.

Gap: None of this captures *why* systems behave as they do. Causal mechanisms remain implicit. This limits trust in novel scenarios (where correlational patterns might not hold) and constrains transfer learning across domains.

Research Direction: The field needs theoretical frameworks that combine:

- Structural causal models for mechanism understanding

- World models for trajectory simulation

- Agentic reasoning for goal-directed behavior

This synthesis would enable agents that don't just predict "next state" but understand "why this state follows from that intervention." The business value is clear: organizations need to understand causal mechanisms to deploy agents in high-stakes domains where outcomes matter and explanations are required (regulatory compliance, medical diagnosis, financial risk management).

Looking Forward

February 2026 marks an inflection point, not a destination.

Three theoretical frameworks matured simultaneously. Multiple organizations achieved production-scale deployment. The convergence suggests we've crossed from experimentation to operationalization—from tool to infrastructure.

But the synthesis reveals work ahead:

The evaluation crisis demands new theoretical frameworks for validating collective agent behavior at scale. Theory currently provides agent architectures but not validation methodologies.

The causality gap requires moving beyond next-state prediction toward mechanism understanding. Organizations need agents that explain *why*, not just predict *what*.

The governance paradox forces a reckoning: more capable agents demand more sophisticated oversight. The industry must invest in evaluation infrastructure at scale commensurate with deployment ambitions.

The organizations that recognize this early—that build evaluation infrastructure alongside agent capabilities, that treat uncertainty calibration as a technical requirement rather than a nice-to-have, that invest in legible reasoning before wide deployment—these organizations will define what it means to operate AI infrastructure at scale.

For those still treating agents as experimental tools: the window for that mindset is closing. The question is no longer "Should we deploy agents?" but "How do we deploy them as reliable infrastructure?"

Theory has provided the frameworks. Practice has demonstrated viability at scale. The synthesis reveals what's required for the transition from tool to infrastructure.

What remains is execution.