Prompted LLC

When Agents Need Rehearsal Grounds

Q1 2026·3,000 words

InfrastructureGovernanceCoordination

Theory-Practice Synthesis: When Agents Need Rehearsal Grounds Before Reality

The Moment: February 2026's Inflection Point

We're witnessing something peculiar this February. While AI models solve PhD-level mathematics, they stumble on riddles a child could answer. Ask a leading model "Where does Christmas come before Thanksgiving?" and it correctly says "in the dictionary." Reverse the question—"Where does Thanksgiving come before Christmas?"—and watch it confidently explain that alphabetically, Thanksgiving precedes Christmas.

Salesforce AI Research calls this "jagged intelligence"—sharp peaks of brilliance alongside unexpected valleys of weakness. What makes February 2026 significant is not that this phenomenon exists, but that enterprises can no longer ignore it. The papers from Hugging Face's February 20th digest reveal something essential: the gap between AI capability and AI reliability has become the central engineering challenge of our time.

Four papers from this week's digest illuminate different facets of the same underlying problem—and more importantly, they converge on a solution that enterprise practice is simultaneously discovering through hard-won deployment experience.

The Theoretical Advance

1. Mobile-Agent-v3.5: The Multi-Platform Orchestration Problem

Mobile-Agent-v3.5 (also known as GUI-Owl-1.5) represents a fundamental advance in how AI agents coordinate across computing environments. The paper introduces three critical innovations:

Hybrid Data Flywheel: Rather than relying solely on expensive human annotation or purely synthetic data, the researchers built a self-improving pipeline that combines simulated environments with cloud-based sandbox testing. This generates realistic trajectories at scale while maintaining quality through human-in-the-loop validation. Think of it as creating a perpetual motion machine for training data—each generation improves the next.

Unified Agent Capabilities: The model doesn't just click buttons and type text. It orchestrates tool invocation through Model Context Protocol (MCP), maintains both short and long-term memory, and adapts its behavior based on whether it's operating as a standalone agent or part of a multi-agent system. This addresses a gap that's plagued production deployments: agents that work brilliantly in isolation but fail when coordinating with other systems.

Multi-Platform Reinforcement Learning (MRPO): Perhaps most significantly, the paper introduces an RL algorithm specifically designed for the chaos of training across desktops, mobile devices, and web browsers simultaneously. Traditional RL collapses when different platforms provide conflicting feedback. MRPO solves this through alternating optimization—training on one platform type at a time while maintaining cross-device generalization.

The results speak for themselves: 56.5% success rate on OSWorld (desktop), 71.6% on AndroidWorld (mobile), 48.4% on WebArena (web)—state-of-the-art performance across the board.

Why It Matters: Multi-platform coordination isn't an academic curiosity. Every enterprise agent must operate across heterogeneous environments—sometimes on edge devices for latency-sensitive tasks, sometimes in cloud infrastructure for compute-intensive reasoning. Mobile-Agent-v3.5 proves this can be done reliably.

2. Calibrate-Then-Act: The Economics of Uncertainty

The Calibrate-Then-Act framework asks a deceptively simple question: When should an AI agent commit to an answer, and when should it spend resources gathering more information?

Every API call costs money. Every additional retrieval increases latency. Every verification step delays the user. The framework formalizes this as a sequential decision-making problem where agents must balance cost against uncertainty.

The Core Insight: Rather than forcing agents to learn cost-awareness implicitly through end-to-end training, the researchers provide explicit priors about uncertainty. They show that even a modest 8B parameter model (Qwen3-8B) can reason optimally about exploration-exploitation tradeoffs when given two pieces of information: (1) its own confidence in direct answers, and (2) the cost of gathering additional information.

The method works across three scenarios with increasing realism:

- Pandora's Box: A controlled environment where optimal behavior is mathematically provable. With explicit priors, the model achieves 94% optimal match rate—nearly perfect decision-making.

- Knowledge QA: The model learns when to retrieve external information versus relying on parametric knowledge, calibrated through isotonic regression on validation data.

- Coding Tasks: Agents decide whether to write unit tests, run partial executions, or commit code directly—balancing verification cost against error risk.

Why It Matters: This addresses the elephant in the room for production AI: runaway costs. Enterprises deploying agents at scale discover that naive implementations burn through API budgets. Calibrate-Then-Act provides a principled framework for resource governance.

3. "What Are You Doing?": The Trust Calibration Problem

The intermediate feedback study conducted by researchers examining in-car assistants reveals something counterintuitive: making agents more transparent doesn't just build trust—it actually improves task performance.

The study (N=45) used a dual-task paradigm where participants performed driving-related tasks while interacting with an agentic voice assistant. The experimental manipulation was simple: some participants received intermediate updates during multi-step processing ("I'm searching for nearby restaurants... I'm filtering by your dietary preferences..."), while others only received final responses.

The results were striking:

- Perceived speed improved despite identical actual completion times

- Trust increased significantly, particularly for complex, high-stakes tasks

- Task load decreased—users felt less cognitive burden when they understood what the agent was doing

- User experience improved across all measured dimensions

The interviews revealed nuanced preferences: users wanted high transparency initially to establish trust, but preferred progressively quieter operation once the system proved reliable—with adjustment based on task stakes and situational context.

Why It Matters: Human-AI coordination isn't just about building capable agents. It's about building agents that humans can work alongside. The transparency-efficiency tradeoff is real, but the optimal balance isn't what most engineers assume.

4. Computer-Using World Model: Simulating Before Acting

The Computer-Using World Model (CUWM) represents a fundamental architectural shift. Rather than agents learning solely through interaction with real environments, CUWM enables them to simulate consequences before taking action.

The innovation lies in its two-stage factorization:

1. Textual transition prediction: Given current UI state and a candidate action, predict what will change (in natural language)

2. Visual synthesis: Render those predicted changes into an actual screenshot

This separation of "what changes" from "how it looks" turns out to be critical for training efficiency and generalization.

The model is trained on offline UI transitions from real Microsoft Office applications, then refined with lightweight RL that aligns textual predictions with the structural requirements of computer-using environments. The training data comes from agents actually using software, not from human demonstrations—enabling scale.

At test time, a frozen agent uses CUWM to simulate and compare candidate actions before execution. "Should I click this button or that menu item?" The agent can try both in simulation, observe predicted outcomes, and choose the path most likely to succeed.

Why It Matters: Real execution in desktop environments doesn't support counterfactual exploration. You can't undo database writes or roll back configuration changes. CUWM makes it possible to rehearse reality—a capability that's essential for high-stakes agent deployments.

The Practice Mirror

Business Parallel 1: UiPath's Journey from RPA to Agentic Automation

UiPath, a leader in robotic process automation, is experiencing the exact challenges Mobile-Agent-v3.5 addresses. Their 2025-2026 evolution reveals the operational complexity of multi-platform agent deployment.

The Implementation: UiPath's "agentic automation" platform now supports what they call "thinking agents"—systems that don't just execute predefined workflows but reason about optimal actions. Their FLEX cloud model enables edge-cloud collaboration: lightweight agents on local devices for latency-sensitive tasks, heavyweight reasoning in cloud infrastructure for complex decision-making.

The Challenge: Consistency across platforms. An agent that works flawlessly on Windows desktop might fail unpredictably on macOS or mobile environments. UiPath's engineering teams discovered that training agents independently for each platform creates fragmentation. Training them jointly without specialized algorithms creates interference—exactly the problem MRPO solves.

The Outcome: UiPath's approach mirrors the hybrid data flywheel concept. They use API-first architectures to generate synthetic trajectories, validate them in cloud sandboxes, and refine with selective human oversight. The shift from on-premise to cloud deployment wasn't just about hosting—it enabled the data collection infrastructure necessary for continuous improvement.

What Practice Reveals: Multi-platform orchestration requires more than model capability. It requires infrastructure that can generate training data across environments, maintain consistency despite platform differences, and support edge-cloud collaboration for both performance and privacy.

Business Parallel 2: The Token Budget Crisis

Every organization deploying LLM agents at scale hits the same wall: API costs spiraling out of control. The real-world evidence validates Calibrate-Then-Act's theoretical framework with remarkable precision.

The Implementation: Production teams have developed heuristics that map directly to cost-aware exploration principles:

- Model routing: Use GPT-5 for complex reasoning, Gemini Flash for simple queries (67% cost reduction documented by practitioners)

- Prompt caching: Store and reuse common responses (50-90% savings on repeated queries)

- Output token limits: Constrain response length for routine interactions

- Batching: Combine similar requests to reduce API call overhead

The Challenge: These heuristics lack principled foundations. Teams optimize through trial and error rather than systematic reasoning about cost-uncertainty tradeoffs. A query that could be answered with 80% confidence using a cheap model gets routed to an expensive one because the routing logic is primitive.

The Outcome: Organizations report 40-60% cost reductions through aggressive optimization, but at the price of complexity. Systems accumulate layers of ad-hoc rules that are difficult to maintain and reason about. The economics work, but the engineering doesn't scale.

What Practice Reveals: Production deployments need explicit resource governance frameworks. The gap between "this works in the demo" and "this works at $50K/month in production" is the difference between research and operationalization. Calibrate-Then-Act provides the theoretical foundation that production systems are groping toward through empirical optimization.

Business Parallel 3: Enterprise Chatbots and the Transparency Paradox

Organizations deploying conversational AI face a consistent pattern: users don't trust silent agents, even when they're accurate. The intermediate feedback research validates what enterprises are learning through user complaints and abandonment metrics.

The Implementation: Leading implementations have adopted feedback loop architectures:

- Status updates during long-running operations

- Explanation of reasoning for high-stakes decisions

- Confidence indicators to signal uncertainty

- Progressive disclosure—more detail when stakes increase

The Challenge: Over-communication creates alert fatigue. Under-communication destroys trust. The optimal balance is context-dependent and user-specific. Initial implementations got this wrong—either agents that narrated every internal step (annoying) or agents that worked in silence (mysterious).

The Outcome: Mature systems implement adaptive transparency: high initial verbosity to establish trust, gradually reducing as the agent proves reliable, with dynamic adjustment based on task complexity and user preference. This mirrors the research findings almost perfectly.

What Practice Reveals: Trust calibration requires more than technical feedback design. It requires organizational understanding of user psychology, cultural factors around AI acceptance, and iterative refinement based on real usage patterns. The "What Are You Doing?" study provides empirical validation for practices enterprises discovered through costly deployment experience.

Business Parallel 4: Salesforce's eVerse and the Simulation-First Paradigm

Salesforce AI Research's eVerse framework represents perhaps the most direct operationalization of world model principles in production enterprise systems. It's not theoretical research—it's live infrastructure powering Agentforce deployments.

The Implementation: eVerse creates what Salesforce calls "enterprise digital twins"—synthetic environments that mirror real business operations. The framework operates through three stages:

1. Synthesize: Generate realistic customer data, workflows, and edge cases (90% of domain experts rate the synthetic data as realistic or very realistic)

2. Measure: Stress-test agents across comprehensive scenarios including voice interactions with background noise, accents, poor connections

3. Train: Close performance gaps through reinforcement learning guided by human expertise (69% improvement: from 19% to 88% success rates)

The Application: UCSF Health is using eVerse to train agents for healthcare billing—a domain where errors carry real-world consequences. The agents learn in simulation environments before touching real patient data. The training methodology is identical to CUWM's: simulate reality, observe outcomes, refine policy.

The Infrastructure Play: Launch Consulting documents the broader industry shift: enterprise AI is moving from "language prediction to reality simulation." Organizations are building proprietary simulation layers that competitors cannot easily replicate. The competitive advantage isn't in model capability—it's in the quality of rehearsal environments.

What Practice Reveals: The "jagged intelligence" problem Salesforce identified—agents brilliant one moment, baffling the next—requires simulation infrastructure to solve. You can't learn consistency through real-world deployment when inconsistency carries business risk. Simulation-first isn't a research curiosity; it's the operationalization pathway for production-grade agents.

The Synthesis: What Emerges When Theory Meets Practice

When we overlay these four theory-practice pairs, a meta-pattern crystallizes—one that neither theory alone nor practice alone fully reveals.

Pattern: Where Theory Predicts Practice

The theoretical frameworks accurately forecast real-world deployment challenges:

- Multi-platform coordination: Mobile-Agent-v3.5's MRPO algorithm predicts the exact orchestration problems UiPath encounters in production edge-cloud architectures

- Cost-awareness: The Calibrate-Then-Act framework maps directly to the token optimization strategies enterprises discover through painful trial and error

- Trust dynamics: Intermediate feedback research validates the transparency mechanisms mature chatbot deployments implement

Theory isn't catching up to practice or vice versa—they're converging on shared insights from different directions. The papers formalize patterns that production teams have been observing empirically.

Gap: Where Practice Reveals Theoretical Limitations

But business deployment also exposes what the papers don't address:

Jagged Intelligence: Salesforce's observation that agents can be simultaneously brilliant and baffling isn't captured in benchmark performance metrics. The papers report success rates (56.5% on OSWorld, 88% on enterprise tasks post-training) but not consistency variance. Practice reveals that reliability matters as much as capability.

Resource Governance Heuristics: Production cost optimization relies on pragmatic engineering (model routing, caching, batching) rather than principled uncertainty reasoning. The gap between Calibrate-Then-Act's elegant framework and messy production implementations highlights the difference between demonstration on synthetic problems and deployment at enterprise scale.

Cultural Factors: Trust calibration in practice requires organizational change management, not just technical feedback design. The intermediate feedback study provides psychological insights, but doesn't address how enterprises actually navigate the human dimensions of AI adoption—training, change resistance, governance structures.

These gaps don't invalidate the theory. They reveal that operationalization requires infrastructure beyond what papers can demonstrate.

Emergence: What the Combination Reveals That Neither Alone Shows

Here's the core synthesis, the insight that crystallizes only when viewing theory and practice together:

Operationalization requires simulation before production.

This isn't obvious from theory, which focuses on individual capability advances. It's not fully articulated in practice, where teams build simulation infrastructure ad-hoc without recognizing it as a systematic requirement.

But the pattern is unmistakable:

- World models enable testing: CUWM and eVerse both create rehearsal environments where agents can fail safely

- Cost-awareness enables resource governance: Explicit uncertainty reasoning makes token budgets tractable at scale

- Feedback enables trust: Transparency mechanisms calibrate human expectations to agent capability

- Multi-platform support enables scale: Edge-cloud orchestration balances performance, privacy, and compute economics

These aren't separable features you can implement independently. They're interdependent infrastructure requirements for production-grade agentic systems.

February 2026 marks an inflection point not because any single capability reached threshold, but because multiple requirements converged simultaneously. "Agent-ready" now means having:

- Testing environments (simulation infrastructure)

- Cost controls (resource governance frameworks)

- Transparency mechanisms (feedback loops)

- Platform orchestration (edge-cloud coordination)

Organizations that treat these as separate engineering challenges will struggle. Organizations that recognize them as facets of a unified operationalization problem will build competitive moats.

Temporal Relevance: Why This Matters Specifically in February 2026

Timing matters. These papers didn't emerge in a vacuum—they respond to pressure from production deployments.

2023-2024 was the LLM experimentation era: demos, pilots, proof-of-concepts. The question was "What's possible?"

2025-2026 is the agent production era: scale, reliability, economics. The question is "What's deployable?"

The shift from language models to agentic systems introduced new failure modes that experimentation didn't expose:

- Inconsistency at scale: What works in 100 test cases fails unpredictably in production

- Cost explosions: What costs $50 in development costs $50,000 in production

- Trust collapse: What impresses in demos confuses in daily use

- Platform fragmentation: What runs on one environment fails on another

The papers from February 20th directly address these production challenges. Theory is catching up to practice. Practice is validating theory. And the convergence creates clarity that neither stream provided alone.

Implications

For Builders: Infrastructure Before Intelligence

If you're engineering agentic systems, the synthesis points to clear priorities:

1. Build simulation environments first. Don't wait until you have a capable agent to build testing infrastructure. The simulation layer is more valuable than incremental capability improvements. Salesforce's 69% improvement from RLHF in eVerse dwarfs the gains from switching to a slightly better base model.

2. Implement explicit resource governance. Token budgets aren't optional cost optimizations—they're architectural requirements. Design your agent with uncertainty-aware decision-making from the start, not as a retrofit after costs spiral.

3. Instrument for observability. The intermediate feedback research shows transparency isn't just user-facing—it's diagnostic infrastructure. If you can't explain what your agent is doing during operation, you can't debug when it fails. Build introspection mechanisms as core functionality, not afterthoughts.

4. Design for multi-platform from day one. Edge-cloud orchestration isn't about squeezing out latency gains. It's about enabling deployment architectures where:

- Privacy-sensitive operations happen on-device

- Compute-intensive reasoning happens in cloud

- Cost-sensitive queries route to smaller models

- High-stakes decisions route to verification layers

Treating this as a performance optimization problem rather than an architectural requirement leads to systems that can't scale beyond pilot deployments.

For Decision-Makers: The Simulation Moat

If you're evaluating AI investments, the synthesis reveals where competitive advantage accrues:

Proprietary simulation environments are the new moat. Model capabilities commoditize faster than infrastructure. Organizations building high-fidelity simulation layers for their specific domains create advantages that don't erode when OpenAI releases GPT-6 or Anthropic ships Claude-5.

Salesforce's CRMArena-Pro, UCSF Health's healthcare billing simulator, UiPath's multi-platform testing infrastructure—these aren't model wrappers. They're operationalization substrates that accumulate value over time as they capture domain knowledge and edge cases.

Resource governance is a strategic capability, not a cost center. The difference between $50K/month and $500K/month in API costs isn't just financial—it's the difference between sustainable experimentation and budget-constrained incrementalism. Organizations that solve cost-awareness systematically can iterate faster than those solving it through quarterly spending reviews.

Trust infrastructure compounds. Transparency mechanisms, feedback loops, and observability layers build organizational trust that enables broader deployment. The returns aren't immediate, but they're cumulative. Organizations that invest early in human-AI coordination patterns create cultures where AI adoption accelerates rather than stalls on trust concerns.

For the Field: From Benchmarks to Behavior

If you're advancing AI research, the synthesis suggests important redirections:

Consistency deserves equal attention to capability. The "jagged intelligence" phenomenon isn't a temporary artifact of current architectures—it's a fundamental challenge when deploying capable but unreliable systems. Research that improves average performance without addressing variance creates systems that are more frustrating, not more useful.

Cost-aware reasoning needs standardization. Calibrate-Then-Act provides a framework, but it's not yet a standard methodology. The field needs:

- Benchmark datasets for cost-aware decision-making

- Evaluation metrics that balance performance against resource consumption

- Training paradigms that bake in resource constraints from the start

Simulation infrastructure is under-studied relative to importance. World models for desktop software, enterprise workflows, and domain-specific operations are infrastructural advances that enable everything else. But they're less glamorous than capability benchmarks, so they receive less research attention relative to impact.

Human-AI coordination patterns need systematic treatment. The intermediate feedback study is empirical psychology as much as AI research. The field needs more work at the intersection of HCI, cognitive science, and machine learning—not just building more capable agents, but understanding how humans actually work alongside them.

Looking Forward: The Question Nobody's Asking Yet

We've established that simulation infrastructure is essential for production-grade agents. We've shown that theory and practice are converging on this insight simultaneously. We've outlined what builders should do, how decision-makers should invest, and where researchers should focus.

But here's the provocative question this synthesis raises:

What happens when simulation environments become more sophisticated than the reality they're modeling?

Right now, simulations are simplified approximations of reality—useful for testing, but obviously incomplete. Salesforce's CRMArena-Pro is realistic enough to train agents, but nobody would mistake it for actual customer service operations. CUWM predicts Office UI transitions, but can't capture every edge case.

But what happens as simulation fidelity increases? As world models incorporate more variables, capture more context, predict with greater accuracy?

At some point—perhaps sooner than we expect—simulation environments will contain richer instrumentation, better observability, more complete state representation than the real systems they model. The "digital twin" becomes more knowable than the physical twin.

When that happens, does the optimal strategy flip? Instead of using simulation to test agents before deploying to reality, do we deploy to simulation permanently and use reality as the validation layer?

This isn't abstract philosophy. It's coming operational question for domains where simulation costs less than mistakes: Healthcare diagnosis. Financial risk assessment. Infrastructure planning. Legal reasoning.

February 2026's papers don't answer this question. But they make it unavoidable. The convergence of theory and practice around simulation-first infrastructure isn't the end of a research trajectory—it's the beginning of a much stranger one.

The organizations that recognize this earliest won't just operationalize AI more effectively. They'll reshape how we think about the relationship between simulation and reality itself.