The Limits of ‘Thinking’ AI: A Cautionary Tale for Mission-Critical Military COTS

September 3, 2025 COTS Staff

By Buck Biblehouse, The Open Group FACE Consortium Director

In a recent white paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” Apple researchers deliver a sobering assessment of the state of Large Reasoning Models (LRMs). Their findings strike at the heart of current AI hype and raise urgent red flags for defense procurement officers and systems integrators relying on commercial off-the-shelf (COTS) AI systems for mission-critical military solutions. As mobile command systems and layered drone defenses emerge—often under private-sector pace and funding—a clearer understanding of AI’s fragility under complexity is not just prudent—it is indispensable.

The Current Limitations of AI: When “Reasoning” Collapses Under Stress

Apple’s experiments pitted leading LRMs (including OpenAI’s o1/o3 series, Anthropic’s Claude 3.7 Sonnet Thinking, Google’s Gemini, and others) against classic logic puzzles—Tower of Hanoi, River Crossing, Blocks World—scaled from trivial to extremely complex scenarios. The results: a complete accuracy collapse beyond modest complexity thresholds. Crucially, even when models were provided with the correct algorithm, they failed—highlighting that their “reasoning” is largely pattern-matching, not logical inference.

Further, LRMs paradoxically decrease inference effort—reducing token usage—as problem complexity rises, suggesting they effectively “quit” rather than persist toward solutions.

MarketWatch

These findings contribute to growing skepticism around the Artificial General Intelligence trajectory, with some researchers now calling for a reconsideration of expectations.

Why This Matters for Mission-Critical Military Applications

In military and intelligence systems—where lives and strategic objectives hinge on system reliability—AI cannot fail due to opaque decision-making or brittleness under novel conditions.

High-stakes environments (e.g., counter-UAS systems, swarming drones, battlefield command) demand systems that maintain reliability under unexpected or evolving threat profiles.

Predictability and auditability are paramount for warfighter trust. If an AI system “collapses” without traceable logic, it is far less likely to be trusted in the loop.

Human-in-the-loop fallback must be viable even when the AI’s reasoning process is flawed or fails.

Apple’s paper reminds us that no matter how large or “intelligent” an AI model claims to be, lacking robust logical reasoning means it cannot yet be the backbone of autonomous, life-critical defense systems.

Approaches to AI: Narrow vs. Broad—and Where COTS Fits In

Narrow AI (Specialized Models)

Rule-based hybrid systems (e.g., classical symbolic reasoning plus ML classifiers) emphasize reliability in defined domains.
Purpose-built LLMs with guardrails—trained for limited tasks such as target recognition or communication parsing—can deliver value, provided scope remains tightly defined.
Model ensembles with fallback algorithms, combining AI outputs with deterministic systems, can safeguard against collapse.
These are realistic for COTS deployment: focused, optimized, and bounded in capability.

Broad AI (LRMs/LLMs with “Thinking”)

Attractive for generalizability but—Apple’s study reveals—susceptible to novel complexity and deception. Scaling up model size or data input does not guarantee robustness in logic tasks. Until more fundamental architectures emerge, their role in mission-critical systems should be cautious and tightly scoped.

Leveraging Advanced Processors: A Game Changer for Military-Grade AI

The hardware enabling AI inference matters. On-device or edge processors must deliver:

Deterministic behavior, avoiding variability due to computational fluctuations.
Low latency and high throughput under field conditions.
Hybrid computing stacks, combining CPU, GPU, and AI accelerators, to support both classical and ML methods.

Apple’s models on their M-series chips highlight privacy and efficiency for consumer devices—but critics argue these are not qualified for large-scale military model training or inference. Indeed, some experts suggest Apple’s negative findings may stem partly from its limited access to server-grade hardware.

For military purposes, COTS systems must integrate:

Server-grade accelerators or purpose-built AI SoCs that support quantized inference, high bandwidth, and real-time decision loops.
Redundant compute paths to mitigate single-point failure on edge AI inference.
Crypto-hardened firmware to prevent adversarial exploitation of AI vulnerabilities.

Anduril’s Perspective: Responsible and Agile AI for Defense

While specific statements from an Anduril CTO are not publicly documented, Anduril’s leadership—including CEO Brian Schimpf and founder Palmer Luckey—reflects a clear philosophy around military AI:

In a public announcement, Schimpf described the OpenAI partnership as a way to harness cutting-edge AI to fill urgent air-defense gaps and support faster, more accurate decisions – while maintaining responsibility and oversight.

Anduril’s platform, Lattice, fuses multi-sensor data into AI-powered, real-time situational awareness—a blend of ML and domain-specific logic for autonomy.

Army Recognition

Meta’s CTO, in partnership with Anduril, emphasizes dual-use XR tech that runs on commercial components and is delivered rapidly to the battlefield—another example of COTS innovation applied to military contexts.

These statements reinforce a clear thesis: Build defense-grade AI using commercially mature technologies, combine them with rigorous oversight, and iterate rapidly, rather than gamble on unproven, general-purpose “reasoning” systems.

Conclusion & Recommendations

Choose narrow-task AI for mission-critical systems. Build systems that combine deterministic logic with machine learning, where it adds value, while ensuring fallback and validation.
Validate “reasoning” under complexity. Use controlled puzzle-style validation frameworks, akin to Apple’s, before deploying AI in high-risk environments.
Invest in defense-grade compute infrastructure. Edge and core processing units must support robustness, redundancy, and secure, real-time execution of AI plus classical logic.
Embrace modular, updateable COTS architectures. Anduril’s open architectures (Fury, Lattice), continuous integration workflows, and XR/Mobile components reflect a future-forward procurement strategy—commercial in innovation, resilient in deployment.
Ensure human-in-the-loop and oversight. AI should augment operators—not replace them. Oversight, auditability, and clarity in decision logic are essential foundations.

In sum, Apple’s white paper serves as a technological reality check: AI’s apparent “reasoning” is often brittle and superficial. Military COTS integrators must heed this, building layered, modular systems where AI is a supporting actor—not the star—underpinned by logic, hardware reliability, and apparent human oversight. Anduril’s approach—leveraging commercial innovation within rigorous defense contexts—offers a compelling model for how to proceed with AI in settings where failure is not an option.

Leave a Reply Cancel reply