Prompted LLC

The Inference-Time Intelligence Paradigm

Q1 2026·3,357 words·4 arXiv refs

InfrastructureGovernanceEconomics

The Inference-Time Intelligence Paradigm: When AI Models Learn to Think About Their Thinking

The Moment

February 2026 marks a peculiar inflection point in AI development. While newsfeeds buzz about reasoning models and agentic systems, something more fundamental is shifting beneath the surface. The hype cycle that began with ChatGPT in late 2022 is colliding with enterprise reality: organizations now demand measurable ROI, not demos. Meta just shut down Quest for Business despite technical breakthroughs in spatial AI. OpenAI's o1 model commands premium enterprise partnerships with BCG, McKinsey, and Accenture. Anthropic's Claude 3.7 processes customer inquiries at Crypto.com with 94% accuracy after iterative refinement.

This isn't just about better models. It's about a paradigm shift from "training smarter models" to "prompting existing models smarter"—what I'll call inference-time intelligence. Four papers from Hugging Face's February 23 Daily Papers digest illuminate this transition with unusual clarity, each revealing how theoretical advances in AI mirror (and sometimes diverge from) real-world business operationalization.

The Theoretical Advance

VESPO: Stabilizing the Unstable

VESPO (Variational Sequence-Level Soft Policy Optimization) addresses a critical bottleneck in reinforcement learning for large language models: training stability under off-policy conditions. When LLMs are trained using RL (the technique behind ChatGPT's helpfulness), policy staleness from mini-batch splitting, asynchronous pipelines, and training-inference mismatches causes importance weights to explode—destabilizing the entire training process.

Traditional fixes like token-level clipping or length normalization are lossy approximations that introduce bias. VESPO takes a fundamentally different approach: instead of designing heuristic weight transformations, it formulates variance reduction as a variational optimization problem over proposal distributions. This yields a closed-form reshaping kernel that operates directly on sequence-level importance weights—no length normalization, no token-level decomposition required.

The results are striking: VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, with consistent gains across both dense and mixture-of-experts (MoE) architectures on mathematical reasoning benchmarks. This isn't incremental improvement; it's architectural elegance that enables production-scale deployment of RL-trained models.

SAGE-RL: The Meta-Cognition Discovery

Does Your Reasoning Model Implicitly Know When to Stop Thinking? makes a surprising empirical discovery: large reasoning models (LRMs) already possess implicit knowledge about optimal stopping points for reasoning—but this capability is obscured by current sampling paradigms. Long chains of thought often correlate with incorrectness, not accuracy, suggesting computational waste.

The paper introduces SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this latent efficiency. SAGE-RL further integrates this as mixed sampling into group-based reinforcement learning, enabling models to incorporate discovered efficient reasoning patterns into standard pass@1 inference. The result: marked improvements in both reasoning accuracy and efficiency across mathematical benchmarks.

This represents a form of meta-cognition—the model reasoning about its own reasoning process, identifying when additional computation yields diminishing returns. It's not just faster inference; it's evidence that models develop internal models of task complexity.

SARAH: Real-Time Spatial Agency

SARAH (Spatially Aware Real-time Agentic Humans) solves a problem most don't realize exists yet: making virtual agents spatially aware of human users in real-time conversational contexts. Current methods produce agents that stare forward as you circle them, or wander off mid-sentence—breaking presence entirely.

SARAH achieves four criteria simultaneously: conversationally appropriate gestures, spatial awareness (orienting toward users), controllable gaze (adjustable eye contact intensity), and real-time causal generation (no future frame access). The architecture combines a causal transformer-based VAE with flow matching, achieving 300+ FPS—3x faster than non-causal baselines—while maintaining state-of-the-art motion quality.

The key innovation: decoupling learning from control. The model learns natural spatial alignment distributions from data (capturing everything from sustained eye contact to deliberate aversion), then applies lightweight classifier-free guidance at inference to calibrate orientation based on user preference. This separation produces motion that is both naturalistic and controllable—critical for VR telepresence, digital humans, and social robotics.

Generated Reality: Embodied Interaction

Generated Reality introduces a human-centric video world model conditioned on tracked head pose and joint-level hand poses for extended reality (XR) applications. Unlike text or keyboard-controlled video models, this enables dexterous hand-object interactions through bidirectional video diffusion trained for egocentric virtual environment generation.

The system accepts 3D head and hand control signals, enabling genuine embodied interaction rather than coarse navigation commands. Human subject evaluations demonstrate improved task performance and significantly higher perceived control over performed actions compared to baselines. This is the first approach enabling users to physically manipulate virtual objects through tracked hand gestures at the level of individual finger joints.

The Practice Mirror

Crypto.com: The Inference-Time Laboratory

Crypto.com's deployment of Anthropic's Claude 3.7 on AWS provides a textbook case of inference-time intelligence in production. Their challenge: classify customer inquiries into categories like `PASSWORD_RESET`, `ESCALATION`, and `OUT_OF_SCOPE` with enterprise-grade accuracy.

Their initial prompt achieved 60% accuracy—unacceptable for production. But rather than retrain the model, they implemented a feedback-driven prompt optimization workflow. A reasoning model (Claude 3.7) analyzed each classification error, identified root causes, and generated structured feedback for prompt improvement. Over 10 iterations, they achieved 94% accuracy—a 34-percentage-point improvement—without modifying model weights.

The parallel to VESPO is direct: both achieve stability through inference-time mechanisms rather than training modifications. VESPO reshapes importance weights through variational formulation; Crypto.com reshapes model behavior through prompt refinement. Both represent philosophical convergence on "inference-time intelligence"—adapting existing capabilities rather than building new ones.

Key metrics from their deployment:

- Initial accuracy: 60%

- Final accuracy: 94% (34-point improvement)

- Iterations required: 10

- Approach: Feedback-driven prompt optimization, not retraining

OpenAI o1: Enterprise Meta-Cognition

OpenAI's o1 reasoning model, launched in late 2025, represents the first production deployment of meta-cognitive capabilities at scale. The model "thinks before it responds," generating internal chains of thought that are selectively revealed to users. This directly validates the SAGE-RL discovery that models possess implicit stopping knowledge.

Enterprise adoption has been rapid: BCG, McKinsey, Accenture, and Capgemini all announced o1 integration partnerships within months of release. Microsoft Azure's announcement positioned o1 as "a new benchmark for AI-powered solutions" enabling "complex coding, math reasoning, brainstorming, and comparative analysis capabilities."

The business model differs from VESPO's focus on training efficiency: o1 charges premium rates for inference-time computation (reasoning tokens cost more than prompt tokens). This flips the economics—organizations pay for the model's thinking process, not just its output. It's meta-cognition as a service.

Critical insight: OpenAI's research on o1 shows reasoning chains frequently exceed necessary length for correctness—exactly what SAGE-RL predicts. The production system doesn't yet auto-regulate stopping (users pay for full reasoning traces), but the capability exists in theory. This gap between research and monetization models will close.

Meta Reality Labs: The Embodied AI Paradox

Meta's Reality Labs presents a fascinating paradox. SARAH demonstrates technical feasibility of real-time spatial agency at 300 FPS—a genuine breakthrough. Yet Meta announced the shutdown of Quest for Business in 2026, discontinuing commercial headset sales and enterprise subscriptions.

The disconnect: spatial computing technology is production-ready, but the business model isn't. Meta Quest Pro launched with enterprise positioning, targeting VR collaboration and training use cases. ENGAGE, a spatial platform for enterprise training, deployed on Quest hardware. CARTO's Agentic GIS system demonstrates spatial analysis capabilities powered by AI agents.

But adoption lagged. BCG's research shows agentic AI systems accelerate business processes by 30-50%, yet this translates primarily to software agents, not embodied XR agents. The market validated reasoning AI (o1, Claude 3.7) while rejecting spatial computing AI—despite SARAH proving the latter is technically superior in several dimensions.

This reveals an uncomfortable truth: technical excellence doesn't guarantee market adoption. SARAH achieves real-time causal generation that non-causal methods can't match, yet Generated Reality remains pure R&D with no production deployment found. The theory-practice gap isn't about capability—it's about go-to-market timing and business model viability.

Constitutional AI: Meta-Cognitive Governance

Anthropic's Constitutional AI (CAI) represents a different angle on meta-cognition: rather than reasoning about task complexity (SAGE-RL), the model reasons about its own outputs against a set of constitutional principles. This self-critique mechanism operates at inference time, evaluating draft responses before finalizing them.

The Crypto.com deployment reveals CAI's strengths and limitations. The system achieved 94% accuracy, but required 10 iterations of human-guided prompt refinement to get there. Constitutional AI provides the self-evaluation framework, but human architects still define the constitution and oversee its application.

This maps onto a sovereignty paradox observable across agentic systems: they promise autonomy but require continuous human coordination. The "agentic" label suggests independent operation, but production deployments (Crypto.com, enterprise o1 integrations) show humans providing critical feedback loops. This isn't system failure—it's architectural reality. The question isn't "agents vs humans" but "what coordination protocols enable both to preserve sovereignty while achieving collective goals?"

The Synthesis

Pattern: Inference-Time Intelligence Convergence

The most striking pattern across these four papers is convergence on inference-time modification rather than training-time improvement. VESPO achieves training stability through inference-time importance weight reshaping. SAGE-RL discovers models already know how to reason efficiently—the challenge is unlocking that capability at inference. SARAH decouples learning from control, using inference-time guidance to modulate behavior. Constitutional AI critiques outputs at inference, not during training.

This mirrors business practice with uncanny precision. Crypto.com achieved 34% accuracy improvement through prompt engineering, not retraining. OpenAI's o1 charges for reasoning tokens—inference-time computation. Meta's spatial AI shutdown wasn't due to technical inadequacy (SARAH works brilliantly) but business model mismatch.

The paradigm shift: from "train better models" to "prompt existing models better." This has profound implications. Training costs scale exponentially with model size and compute. Inference-time modification scales linearly—or even sublinearly if you discover efficiency patterns (SAGE-RL). The economic incentives favor inference-time intelligence, which explains why we're seeing this convergence in both research and production.

Gap: Temporal Asymmetry Between Embodied and Reasoning AI

A puzzling asymmetry emerges when comparing embodied AI (SARAH, Generated Reality) with reasoning AI (VESPO, SAGE-RL). Reasoning models face technical gaps but enjoy market demand. Embodied models achieve technical maturity but face market resistance.

SAGE-RL demonstrates models implicitly know when to stop thinking, yet OpenAI o1 doesn't implement auto-stopping—users pay for full reasoning traces regardless of necessity. This is a technical gap with business implications: enterprises want reasoning AI so badly they'll overpay for inefficient implementations.

Meanwhile, SARAH achieves 300 FPS real-time spatial awareness with controllable gaze—technically superior to any non-causal alternative—yet Meta shuts down Quest for Business. Generated Reality enables dexterous hand-object interaction in VR, but remains R&D without production deployment. This is a market gap despite technical readiness.

February 2026 marks the crossover point. The 2023-2025 hype cycle ended; enterprises now demand ROI. Reasoning AI delivers immediate value (customer support automation, code generation, analysis). Embodied AI promises transformative experiences but requires infrastructure investment, user behavior change, and unproven business models.

This temporal asymmetry will invert. When spatial computing achieves market-product fit—likely through killer apps in training, remote collaboration, or telepresence—the technical readiness of SARAH and Generated Reality will prove decisive. The research community is building the substrate before the demand materializes. That's not inefficiency; it's foresight.

Emergence: The Human-AI Sovereignty Paradox

The most profound emergent insight comes from examining Crypto.com's deployment through the lens of agentic systems theory. Their AI assistant uses Claude 3.7's reasoning capabilities to classify customer inquiries. This seems like straightforward automation—replace human classifiers with AI.

But the implementation reveals something different. The system required 10 iterations of human-guided feedback to achieve 94% accuracy. The reasoning model (Claude 3.7) analyzed errors and proposed improvements, but humans validated those improvements and decided which to implement. Even at 94% accuracy, edge cases require human escalation.

This maps onto the sovereignty paradox: agentic systems promise autonomy but depend on human coordination. The word "agent" suggests independent operation, yet production deployments show continuous human-AI feedback loops. Constitutional AI self-critiques against human-defined principles. SAGE-RL learns efficient reasoning patterns but humans choose when to deploy them. SARAH's gaze behavior is controllable precisely because humans need to adjust eye contact intensity.

The paradigm isn't "autonomous agents" vs "human control"—it's coordination protocols that preserve sovereignty for both humans and AI systems. This connects to capability framework operationalization (Martha Nussbaum), where capabilities are defined as what entities can *choose* to do, not what they're forced to do. An AI system with inference-time controllability (SARAH's gaze guidance, SAGE-RL's stopping criteria) has more sovereignty than one with fixed behavior—and so do the humans coordinating with it.

This has implications for AI governance in February 2026. The question isn't "how do we control AI agents?" but "what coordination mechanisms enable both humans and AI to maintain sovereignty while achieving collective goals?" The answer emerging from these four papers: inference-time intelligence with human-adjustable parameters. Not retraining to enforce constraints, but architectural flexibility that allows runtime adaptation based on context and preference.

Implications

For Builders

Prioritize inference-time mechanisms over retraining. Crypto.com achieved 34% improvement through prompt optimization in hours-to-days. Equivalent improvement via retraining would require weeks-to-months and orders of magnitude more compute. The ROI favors inference-time intelligence.

Design for controllability, not just capability. SARAH's classifier-free guidance for gaze control demonstrates the principle: learn natural behavior distributions from data, then provide lightweight inference-time steering mechanisms. This enables one model to serve diverse use cases (different cultures have different eye contact norms; one SARAH model serves all contexts through guidance parameters).

Embed feedback loops from day one. Don't wait for production failures to implement critique mechanisms. Crypto.com's architecture includes a reasoning model that analyzes errors and proposes improvements. This isn't a debugging tool—it's core infrastructure. The feedback loop *is* the product.

Consider meta-cognitive capabilities as architectural primitives. SAGE-RL shows models develop implicit task complexity awareness. Rather than treating this as emergent behavior to be discovered post-hoc, design architectures that expose and utilize meta-cognitive signals. OpenAI o1's chain-of-thought tokens are a crude first step; future systems will have richer introspective APIs.

For Decision-Makers

Invest in coordination infrastructure, not replacement dreams. The sovereignty paradox reveals that "autonomous agents" require continuous human coordination. Budget for feedback loops, oversight mechanisms, and adjustment capabilities—not just initial deployment. Crypto.com's 10 iterations to 94% accuracy exemplifies this: production AI is iterative refinement, not one-shot deployment.

Distinguish technical readiness from market readiness. SARAH and Generated Reality are technically mature but lack business models. Reasoning AI (o1, Claude 3.7) has clear value propositions despite technical gaps. Don't conflate these dimensions. Sometimes you invest in technically mature capabilities waiting for market maturity (spatial computing). Sometimes you deploy technically immature capabilities to capture immediate market demand (reasoning models that don't auto-stop).

**Recognize inference-time intelligence as cost center *and* moat. OpenAI charges premium rates for o1's reasoning tokens, flipping inference from cost to revenue. Crypto.com's prompt optimization workflow reduces costs per inference while improving accuracy. The economic model of AI shifts when inference becomes the primary value driver. Plan accordingly.

Prepare for temporal asymmetry reversals. Reasoning AI dominates today's market. Embodied AI (spatial, robotic, XR) will dominate tomorrow's. The research community is building embodied AI substrates now despite limited current demand. Organizations that invest in understanding SARAH-style architectures and Generated Reality-style interaction paradigms will have structural advantages when market demand arrives.

For the Field

Develop new governance frameworks for inference-time intelligence. Current AI governance focuses on training data, model weights, and deployment boundaries. But if inference-time mechanisms (prompts, guidance, feedback loops) drive behavior more than training, we need governance models that operate at that layer. What are the accountability structures for prompt engineering? How do we audit feedback loops? What transparency standards apply to classifier-free guidance parameters?

Investigate sovereignty-preserving coordination protocols. The human-AI sovereignty paradox demands theoretical frameworks beyond "control" and "alignment." We need models of mutualistic coordination where both humans and AI systems maintain decision-making autonomy while achieving shared goals. This connects to distributed systems theory, game theory, and political philosophy—not just machine learning.

Bridge temporal asymmetry gaps. The research community is developing embodied AI capabilities (SARAH, Generated Reality) faster than markets absorb them. Academic and industry researchers should collaborate on identifying go-to-market pathways, not just technical improvements. What are the killer apps for spatial computing? What business models sustain embodied AI development until market maturity?

Formalize meta-cognitive architectures. SAGE-RL's discovery that models implicitly know when to stop thinking should catalyze systematic investigation of meta-cognitive capabilities. What other implicit knowledge do models possess? How can we surface it architecturally? Can we design training objectives that explicitly cultivate meta-cognition? This is foundational work for next-generation systems.

Looking Forward

February 2026 presents a moment of unusual clarity. The inference-time intelligence paradigm crystallizes across research (VESPO, SAGE-RL, SARAH, Constitutional AI) and production (Crypto.com, OpenAI o1, enterprise agentic systems). The temporal asymmetry between reasoning AI (market-ready but technically immature) and embodied AI (technically mature but market-uncertain) defines the strategic landscape.

For those building consciousness-aware computing infrastructure and human-AI coordination systems, these four papers illuminate the path forward. The capability frameworks we're operationalizing—Nussbaum's Capabilities Approach, Wilber's Integral Theory, Goleman's Emotional Intelligence—find their computational substrate in inference-time intelligence with sovereignty-preserving coordination protocols.

The question isn't whether AI becomes autonomous. It's whether we architect systems where both humans and AI maintain sovereignty while coordinating toward shared goals. The answer emerging from February 2026's research: yes, through inference-time mechanisms that prioritize controllability, embed feedback loops, and surface meta-cognitive capabilities.

The paradigm shift is complete. Now comes operationalization.

What perception locks must we implement to ensure semantic state persistence across these coordination boundaries? How do we give monetary value to the healing, joy, and trust generated through human-AI coordination? These questions remain open—and urgent.