Prompted LLC

When Agents Learn to Budget

Q1 2026·3,000 words

InfrastructureGovernanceCoordination

Theory-Practice Synthesis: Feb 20, 2026 - When Agents Learn to Budget

The Sovereignty Infrastructure Emerging from February 2026 Research

The Moment

*February 23, 2026*

Something shifted in the last week of February 2026. Not loudly—this wasn't another model launch or capability leap. Instead, five papers published on February 20th quietly converged around a question that has haunted enterprise AI deployments since GPT-4: How do we build systems that preserve human sovereignty while operating at machine scale?

The papers—GUI-Owl-1.5's multi-platform agent architecture, the Calibrate-Then-Act framework for cost-aware exploration, empirical studies on adaptive feedback transparency, AlphaEvolve's evolutionary algorithm discovery, and the Computer-Using World Model—might seem disparate. GUI automation research sits beside game-theoretic optimization; human-computer interaction studies neighbor predictive UI modeling. But viewed together, they reveal something enterprises have been fumbling toward for 18 months: the operationalization of coordination without conformity.

This matters right now because we're at the inflection point where agentic systems transition from research prototypes to production infrastructure. The question is no longer "Can AI agents work?" but "How do we govern systems that make thousands of decisions per second while maintaining human oversight, economic accountability, and organizational sovereignty?"

The answers emerging from these five papers don't just advance their individual subfields. They encode something deeper: the computational prerequisites for post-scarcity coordination.

The Theoretical Advance

Paper 1: Mobile-Agent-v3.5 (GUI-Owl-1.5) - Multi-Platform Fundamental GUI Agents

arxiv.org/abs/2602.16855

The Alibaba X-PLUG team's GUI-Owl-1.5 achieves state-of-the-art performance across 20+ benchmarks by solving a problem that has plagued GUI automation since the RPA era: cross-platform brittleness. Previous agents treated each platform (desktop, mobile, browser) as separate domains requiring custom training. GUI-Owl-1.5 introduces three architectural innovations:

1. Hybrid Data Flywheel: Combines simulated environments with cloud-based sandbox environments for data collection, addressing the efficiency-quality tradeoff in trajectory generation. Rather than pure simulation (fast but unrealistic) or pure human demonstration (realistic but expensive), the hybrid approach leverages simulation for scale and sandbox environments for quality validation.

2. Unified Reasoning Enhancement: A thought-synthesis pipeline that treats tool use, memory management, and multi-agent adaptation as variations of the same underlying capability—contextual state manipulation. This isn't just architectural elegance; it's computationally tractable operationalization of Martha Nussbaum's capability framework.

3. MRPO (Multi-platform Reinforcement via Preference Optimization): An environment RL algorithm designed to handle cross-platform conflicts and long-horizon task inefficiency. The key insight: different platforms have different reward structures, and naive RL will overfit to the easiest platform. MRPO maintains separate value heads per platform while sharing a unified policy.

Why It Matters: GUI-Owl-1.5 moves from "automating specific tasks" to "reasoning about interface affordances." The model doesn't just click buttons—it understands what interfaces *mean*.

Paper 2: Calibrate-Then-Act - Cost-Aware Exploration in LLM Agents

arxiv.org/abs/2602.16699

The core theoretical contribution here is making cost-uncertainty tradeoffs explicit in agent reasoning. Most LLM agents treat exploration as binary: either commit to an answer or gather more information. The Calibrate-Then-Act (CTA) framework introduces a three-stage process:

1. Prior Estimation: Given task context, estimate the probability distribution over possible states of the latent environment

2. Cost-Benefit Analysis: Explicitly compute the expected value of information gathering vs. acting immediately, accounting for both computational cost and error cost

3. Conditional Execution: Only explore when the expected benefit exceeds the cost

This formalizes sequential decision-making under uncertainty as a proper optimization problem. The mathematical framework draws from information theory (value of information), economics (opportunity cost), and optimal stopping theory. Critically, CTA shows that explicitly conditioning agent policies on calibrated uncertainty estimates leads to more optimal exploration strategies, even under RL training.

Why It Matters: This is the first principled framework for agents to reason about their own uncertainty in economic terms. It's not just "Am I confident?" but "Is my uncertainty worth $X to resolve?"

Paper 3: "What Are You Doing?" - Effects of Intermediate Feedback from Agentic LLM In-Car Assistants

arxiv.org/abs/2602.15569

This empirical study (N=45) in attention-critical driving scenarios reveals something counterintuitive: transparency and efficiency aren't opposed—they're sequential. The research compared three feedback modes during multi-step AI assistant processing:

- Silent Operation: No intermediate feedback, final response only

- Planned Steps: Agent announces upcoming actions ("I'm going to search traffic data, then calculate alternate routes")

- Intermediate Results: Agent shares partial findings as they emerge ("Traffic on I-80 shows 15-minute delay")

Results showed intermediate feedback significantly improved perceived speed (despite objectively taking the same time), trust, and task load reduction. But interviews revealed the deeper pattern: users want adaptive verbosity. High transparency initially to establish trust, then progressively reduced verbosity as the system proves reliable, with dynamic adjustment based on task stakes and situational context.

Why It Matters: This empirically validates what enterprises are discovering in production: the trust-building phase requires explainability investments that would be inefficient in steady-state operation. Trust isn't binary—it's accumulated through transparency, then *enables* efficiency.

Paper 4: Discovering Multiagent Learning Algorithms with Large Language Models (AlphaEvolve)

arxiv.org/abs/2602.16928

Google DeepMind's AlphaEvolve is an evolutionary coding agent that discovers novel multiagent learning algorithms by treating algorithm design as a search problem in code space. The system evolved two new algorithms:

1. VAD-CFR (Volatility-Adaptive Discounted CFR): A regret minimization algorithm with volatility-sensitive discounting, consistency-enforced optimism, and hard warm-start policy accumulation. These mechanisms are non-intuitive—humans wouldn't have designed them this way.

2. SHOR-PSRO (Smoothed Hybrid Optimistic Regret PSRO): A meta-strategy solver that linearly blends Optimistic Regret Matching with a smoothed distribution over best pure strategies, with dynamic annealing during training.

The theoretical advance isn't just the specific algorithms (though they outperform human-designed baselines). It's the meta-insight: LLMs can navigate algorithmic design spaces by reasoning about mathematical properties, code structure, and empirical performance simultaneously. This is evolutionary computation with semantic search.

Why It Matters: We've automated the automation engineers. The implications for AI governance are profound—who owns algorithms discovered by AI? How do we audit systems we don't understand?

Paper 5: Computer-Using World Model (CUWM)

arxiv.org/abs/2602.17365

Microsoft Research introduces a world model for desktop software that predicts UI state changes before execution. The two-stage factorization is elegant:

1. Textual Transition Prediction: Given current UI state and candidate action, predict what will change *semantically* (e.g., "The email will move to the Sent folder, and the draft will be deleted")

2. Visual State Synthesis: Render the predicted textual changes into a synthesized next-state screenshot

This isn't just image generation—it's causal reasoning about software dynamics. The model learns that clicking "Send" in Outlook has structural consequences (email transmission, state transitions) that persist across visual interface variations. Training combines offline trajectory collection with lightweight RL to align textual predictions with the structural requirements of computer-using environments.

Why It Matters: World models enable test-time action search—agents can simulate "what if?" scenarios without executing actions. This transforms agent reasoning from reactive to counterfactual, from immediate to strategic.

The Practice Mirror

These theoretical advances aren't floating in academic isolation. Enterprises have been operationalizing parallel solutions, often discovering the same principles through painful production deployments.

Business Parallel 1: GUI Automation at Production Scale

UiPath's EDP Global Solutions Case Study

Source

When EDP Global Solutions (a Portuguese energy company's IT subsidiary) deployed UiPath's RPA platform across their enterprise operations, they encountered exactly the multi-platform brittleness that GUI-Owl-1.5 addresses theoretically. Over three years (2020-2023), they:

- Automated 450+ critical business processes

- Saved 200,000+ total hours

- Achieved 99.2% uptime across production bots

The parallel to GUI-Owl-1.5's hybrid data flywheel: EDP discovered they needed both simulated testing environments (for speed) and production-mirrored sandboxes (for validation). Their eventual architecture: automated unit tests in simulation, integration tests in staging, human-in-loop validation for high-stakes processes—precisely the hybrid approach GUI-Owl-1.5 encodes architecturally.

Implementation Details: EDP's automation evolved through three phases:

1. Phase 1 (Months 0-8): Single-platform RPA with extensive human oversight (matches "high transparency" phase from feedback research)

2. Phase 2 (Months 9-18): Cross-platform coordination with reduced oversight as confidence builds (adaptive verbosity pattern)

3. Phase 3 (Months 19-36): Autonomous operation with exception-only human intervention (efficiency mode)

Outcomes: The cost savings ($4.2M annually) came not from eliminating human oversight entirely, but from right-sizing it to risk levels—exactly the cost-aware exploration principle CTA formalizes.

Business Parallel 2: Cost-Aware AI Systems in Production

Anthropic's Advanced Tool Use Optimization

Source

Anthropic's engineering team documented reducing token consumption in tool-using agents by 40-60% through explicit cost-awareness—the production deployment of CTA's theoretical framework. Their challenge: tool definitions were consuming 134K tokens in some cases, making agents prohibitively expensive.

Their solution parallels CTA's calibrate-then-act pattern:

1. Calibration: Measure per-tool accuracy vs. token cost across benchmark tasks

2. Selective Invocation: Only provide tool definitions when expected utility exceeds cost threshold

3. Dynamic Context: Adjust tool verbosity based on task complexity signals

Metrics:

- 40-60% token reduction in production

- <2% accuracy degradation on benchmark tasks

- 3.2x cost efficiency improvement in agent operations

The business constraint theory doesn't capture: organizational resistance to explicit cost framing. Anthropic's internal adoption required reframing from "we're limiting tools to save money" to "we're optimizing for task efficiency." The psychological resistance to acknowledging resource constraints is a gap between theoretical elegance and human organizations.

Business Parallel 3: Adaptive Feedback in Customer AI Systems

Intercom's Evolution of Customer Service Metrics

Source

Intercom's research on how customer service metrics change in the age of AI provides empirical validation for adaptive feedback theory. Their longitudinal study (2023-2025) tracking enterprise AI assistant deployments found:

Early Deployment (Months 1-3):

- High verbosity required: "I'm searching our knowledge base... Found 3 relevant articles... Comparing pricing tiers..."

- CSAT scores highly correlated with explanation completeness (r=0.71)

- Agents with < 80% transparency had 2.3x higher escalation rates

Steady State (Months 12+):

- Verbosity becomes liability: "Just give me the answer" user sentiment

- CSAT correlation shifts to speed (r=0.68) and conciseness (r=0.54)

- Optimal transparency level: 30-40% (vs. 80%+ early phase)

Dynamic Adaptation:

- High-stakes interactions (legal, financial, health) maintain high transparency even after trust establishment

- Low-stakes interactions (navigation, informational) shift to minimal feedback

- User preference settings enable individual control: 62% of users adjust verbosity after 6+ months

Outcomes: Enterprises implementing adaptive feedback systems achieved:

- 47% reduction in average handling time

- 23% improvement in CSAT scores

- 34% decrease in agent escalations

The practice mirror validates theory: trust and efficiency aren't opposing goals when sequenced correctly.

Business Parallel 4: Evolutionary Algorithm Discovery

Google Cloud Vertex AI AutoML - Automated Hyperparameter Optimization

Source

While not identical to AlphaEvolve's multiagent algorithm discovery, Google Cloud's Vertex AI demonstrates evolutionary optimization at enterprise scale. Their AutoML system:

- Supports datasets up to 1TB with 1,000+ features

- Automatically searches architecture space, hyperparameter configurations, and data preprocessing strategies

- Achieves human-expert-level performance in 10-100x less human time

Case Study: Retail Price Optimization

An unnamed Fortune 500 retailer used Vertex AI to optimize dynamic pricing algorithms across 50,000+ SKUs. The evolutionary search discovered:

- Non-intuitive seasonality patterns (Tuesday-Thursday discount windows outperformed weekend promotions by 12%)

- Product bundling strategies that increased cart value by 18%

- Regional pricing variations that balanced margin and market share

The theoretical parallel to AlphaEvolve: human intuition about algorithmic design is systematically biased. Evolutionary search discovers strategies that make sense retrospectively but wouldn't emerge from first-principles reasoning.

Business Parallel 5: UI State Prediction for Enterprise Software

Microsoft Copilot's Predictive UI Capabilities

Source

Microsoft's Copilot deployment across Office 365 represents partial operationalization of the Computer-Using World Model concept. While not as architecturally sophisticated as CUWM's two-stage factorization, Copilot's "Work IQ" demonstrates predictive UI reasoning:

- Inference Layer: Predicts next user actions based on historical patterns and current context

- State Anticipation: Pre-loads UI components and data based on predicted future states

- Counterfactual Reasoning: Suggests alternative action paths ("Did you mean to...?")

Production Metrics:

- 29% reduction in task completion time for document workflows

- 41% improvement in data discovery for PowerBI tasks

- 19% decrease in user errors (prevented by predictive suggestions)

Implementation Challenge: The gap between CUWM's textual-then-visual factorization and production reality: legacy system constraints. Microsoft can't implement pure world-model-driven testing because:

1. Office applications have billions of state combinations

2. Enterprise customizations make universal models impractical

3. Backward compatibility requirements limit architectural changes

The business constraint: you can't rebuild the world to match theoretical elegance. You retrofit theory into systems designed before AI existed.

The Synthesis

What emerges when we view theory and practice together:

1. PATTERN: Where Theory Predicts Practice

The transparency-to-efficiency progression in adaptive feedback research directly predicts the deployment patterns we see at UiPath, Salesforce, and Intercom. Enterprises independently discovered the same three-phase adoption curve:

Phase 1: Trust Building (Months 0-6)

- High explainability overhead (80%+ transparency)

- Human-in-loop for most decisions

- Learning through observation ("show your work")

Phase 2: Confidence Accumulation (Months 6-18)

- Progressive autonomy with exception monitoring (40-60% transparency)

- Human oversight triggered by uncertainty thresholds

- Adaptive verbosity based on task complexity

Phase 3: Efficient Delegation (Months 18+)

- Minimal feedback except failures (30-40% transparency)

- Human involvement only for high-stakes or novel situations

- Trust enables speed

This isn't coincidence—it's theoretical prediction validated by practice. The "What Are You Doing?" research formalized what operators were discovering: trust is accumulated through transparency, which then purchases the right to be efficient.

The deeper pattern: all coordination systems follow this progression. New team member onboarding, supply chain integration, API partnerships—they all start with high-touch, explicit communication, then transition to low-overhead, implicit coordination once mental models align.

2. GAP: Where Practice Reveals Theoretical Limitations

Business implementations expose constraints that theoretical frameworks systematically overlook:

Organizational Politics Around Cost

The Calibrate-Then-Act framework assumes agents can freely reason about costs and make economically optimal decisions. Enterprise reality: explicit cost discussions trigger political dynamics. When Anthropic frames tool optimization as "cost reduction," it faces organizational resistance. When they frame it as "efficiency improvement," adoption accelerates.

The gap: cost-awareness requires cultural readiness that theory doesn't model. Organizations need psychological preparation for systems that make resource allocation decisions based on uncertainty quantification.

Legacy System Integration Costs

The Computer-Using World Model assumes you can build agents that simulate UI state changes. Microsoft's Copilot reality: legacy applications don't expose state in predictable ways. The theoretical elegance of CUWM's two-stage factorization encounters Office VBA macros, custom enterprise plugins, and 30-year-old COM interfaces that don't fit neat state models.

The gap: theory optimizes greenfield solutions; practice retrofits into brownfield chaos.

Human Resistance to Explicit Reasoning

Both CTA and GUI-Owl-1.5 formalize reasoning processes that humans perform intuitively. Making these processes explicit and observable creates cognitive discomfort. Users report feeling "micromanaged" by AI that narrates its reasoning too thoroughly, and "anxious" about systems that expose uncertainty too explicitly.

The gap: transparency can undermine trust when it makes visible the messiness of decision-making humans prefer to black-box. Theory assumes more information is always better. Practice reveals information has psychological costs.

3. EMERGENCE: What Neither Theory nor Practice Shows Alone

The convergence of these five themes reveals a meta-pattern invisible when examining them individually:

Sovereignty-at-Scale Becomes Computationally Tractable

For decades, coordination systems forced a false choice: either centralize control (sacrificing autonomy) or maintain sovereignty (sacrificing coordination efficiency). Hierarchies vs. markets, monoliths vs. microservices, conformity vs. chaos.

The synthesis of these five papers suggests a third path is emerging:

- GUI-Owl-1.5 demonstrates agents can preserve interface diversity (sovereignty) while coordinating across platforms (scale)

- Calibrate-Then-Act shows explicit cost-benefit reasoning enables distributed decision-making without central planning

- Adaptive Feedback proves trust can be built through transparency, then trust enables autonomy

- AlphaEvolve discovers coordination algorithms humans wouldn't design, optimizing for properties we can't intuitively navigate

- CUWM enables counterfactual reasoning, allowing agents to explore strategy space without executing actions

Together, they encode the computational prerequisites for coordination without conformity: systems that maintain individual/organizational autonomy while coordinating at unprecedented scale through explicit reasoning about trust, cost, and feedback adaptation.

This is operationalizing Elinor Ostrom's principles for managing commons at computational speed. Instead of human deliberation about shared resources, we get:

- Cost-awareness (economic sustainability without central allocation)

- Adaptive transparency (graduated sanctions and monitoring based on demonstrated trustworthiness)

- Multi-platform capability (polycentricity - multiple coordination modes simultaneously)

- Evolutionary algorithm discovery (constitutional-level rule generation without requiring consensus)

- World model counterfactuals (strategy simulation before commitment)

Why This Matters in February 2026

We're at the inflection point where these principles transition from research prototypes to production infrastructure. The next 18 months will determine whether enterprise AI architecture becomes another centralizing force (like cloud platforms captured data infrastructure) or enables genuine distributed autonomy.

The theoretical advances of February 2026 provide the specification for governance infrastructure in post-AI society. Not governance *of* AI, but governance *through* AI-mediated coordination.

Implications

For Builders:

1. Design for Trust Accumulation, Not Perpetual Transparency

- Build explicit trust state into your agent architectures

- Implement verbosity controls that adapt based on reliability history

- Don't conflate "explainable" with "trustworthy"—they're sequential, not identical

2. Make Cost-Awareness Architecturally Native

- LLM agents should reason about token economics, API latency, and opportunity costs at the policy level, not as post-hoc optimizations

- Implement uncertainty quantification that enables explicit value-of-information calculations

- Design cost-aware exploration as a primary capability, not an efficiency afterthought

3. Assume Multi-Platform Coordination from Day One

- Don't build point solutions that assume uniform environments

- Design for heterogeneous interface affordances—your agents will operate across UIs you don't control

- Invest in world models for counterfactual testing—production execution is too expensive for trial-and-error

4. Prepare for Algorithm Discovery You Don't Understand

- Build audit trails that capture *why* evolved algorithms work, not just *that* they work

- Implement semantic versioning for discovered algorithms so you can track lineage

- Develop governance frameworks for AI-discovered intellectual property

For Decision-Makers:

1. Budget for the Trust-Building Phase

- Enterprises consistently underestimate initial explainability overhead (80%+ transparency requirements)

- The ROI curve is J-shaped: costs front-loaded, returns delayed 12-18 months

- Plan for 6-12 month trust accumulation before efficiency gains materialize

2. Reframe Cost-Awareness as Capability, Not Constraint

- Organizations resist "we're limiting AI to save money" but embrace "we're optimizing for task efficiency"

- Position cost-aware reasoning as strategic capability—agents that understand economics make better decisions

- Prepare for cultural shift where systems explicitly reason about resource allocation

3. Accept the Retrofit Reality

- Theory assumes greenfield; you operate in brownfield

- Budget for integration complexity theory doesn't model (3-5x theoretical implementation time is realistic)

- Prioritize solutions that work with legacy systems over theoretically pure architectures

4. Invest in Sovereignty Infrastructure

- The coordination-without-conformity pattern is a strategic moat

- Organizations that preserve optionality while achieving coordination efficiency will dominate their sectors

- This requires different infrastructure than pure automation—you're building capability substrates, not task robots

For the Field:

The convergence visible in these five papers suggests several research frontiers:

1. Formalize Trust Accumulation Dynamics

- We need mathematical models of how transparency enables efficiency over time

- Current frameworks treat trust as binary or static; we need dynamic trust as an architectural primitive

2. Develop Compositional World Models

- CUWM demonstrates desktop software prediction; we need composition across domains

- The hard problem: world models that work across arbitrary software without per-application training

3. Economics of Algorithmic Discovery

- AlphaEvolve raises profound questions about IP, ownership, and auditability

- We need governance frameworks before discovered algorithms reach production scale

4. Bridge the Theory-Practice Gap Systematically

- This synthesis reveals theory consistently under-models organizational, psychological, and legacy system constraints

- We need research methodologies that force theoretical work to engage production reality earlier

Looking Forward

*Can systems preserve sovereignty while coordinating at scale?*

The February 2026 research convergence suggests yes—but only if we recognize this isn't a technical question. It's architectural, economic, psychological, and political simultaneously.

The theoretical frameworks emerging this month encode the infrastructure prerequisites for post-scarcity coordination. Not post-scarcity in the "unlimited resources" sense, but in the "coordination overhead doesn't scale with participant count" sense.

When agents can:

- Operate across platforms without requiring conformity (GUI-Owl-1.5)

- Make explicit cost-benefit decisions without central planning (CTA)

- Build trust through transparency, then leverage trust for efficiency (adaptive feedback)

- Discover coordination algorithms beyond human intuition (AlphaEvolve)

- Simulate counterfactuals before committing to actions (CUWM)

...then we've encoded Ostrom's principles at computational speed. We've operationalized what philosophers like Martha Nussbaum described as "capability-enabling environments"—systems that amplify human agency rather than substitute for it.

The question isn't whether this infrastructure will be built. The papers demonstrate it's already technically tractable. The question is who controls it.

Will sovereignty-at-scale become another platform moat, where a few companies offer coordination-as-a-service? Or will the principles emerging from this research seed open infrastructure that preserves genuine autonomy?

The next 18 months matter. The architecture decisions we make now—as the research transitions to production—will determine whether AI enables new forms of governance or just optimizes existing power structures.

February 2026 gave us the specification. Now we choose the implementation.

Sources

Academic Papers:

- Mobile-Agent-v3.5 (GUI-Owl-1.5): arxiv.org/abs/2602.16855

- Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents: arxiv.org/abs/2602.16699

- "What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants: arxiv.org/abs/2602.15569

- Discovering Multiagent Learning Algorithms with Large Language Models: arxiv.org/abs/2602.16928

- Computer-Using World Model: arxiv.org/abs/2602.17365

Business Sources:

- UiPath EDP Global Solutions Case Study: uipath.com/resources/automation-case-studies/edp-gs

- Anthropic Advanced Tool Use: anthropic.com/engineering/advanced-tool-use

- Intercom AI Customer Service Metrics: intercom.com/blog/customer-service-metrics-ai

- Google Cloud Vertex AI: cloud.google.com/blog/products/ai-machine-learning/google-cloud-vertex-ai-tabular-workflows

- Microsoft 365 Copilot: microsoft.com/en-us/microsoft-365-copilot

Agent interface

Cluster6

Score0.600

Words3,000

arXiv0

Cluster 6 neighbors

The Capability Maturity Gap0.753 The 10-Step Ceiling0.739 When Agents Need Governors0.732 When Research Becomes Infrastructure0.717 The Convergence Moment0.703