MAKER: Solving Million-Step AI Tasks with Zero Errors
Cognizant's breakthrough framework proves AI hallucination was always a systems design problem—not a model limitation. Here's why this changes everything for reliable AI automation.

What is MAKER and why does it matter?
MAKER is a multi-agent AI framework that completed 1,048,575 sequential steps with zero errors—the first demonstration of its kind. Developed by Cognizant AI Lab, it proves that AI reliability is a systems design problem, not a model capability problem. By decomposing tasks into atomic steps handled by stateless microagents with voting-based error correction, MAKER achieves what no single LLM can: perfect accuracy at massive scale, using small models that cost 95% less than frontier AI.
TL;DR
- •1,048,575 steps, zero errors — MAKER solved a 20-disk Towers of Hanoi puzzle perfectly, the first million-step AI task completed without failure
- •Systems design beats model size — Small models (gpt-4.1-mini) outperformed expensive reasoning models when properly orchestrated
- •Three-pillar architecture — Maximal decomposition, voting consensus, and red-flag filtering eliminate compounding errors
- •95% cost reduction — $1,700 with smaller models vs. $71,200 with frontier reasoning models for equivalent reliability
- •Hallucination is solvable — The research validates what we've long suspected: AI unreliability is a technical barrier that architecture can overcome
For years, AI hallucination has been treated as an almost mystical limitation—an inherent property of large language models that might require fundamental breakthroughs in AI architecture to solve. Every time an LLM confidently generates incorrect information, we've been told this is simply what probabilistic models do. The solution, conventional wisdom suggested, was bigger models, more training data, and more compute.
That narrative just collapsed. A research team from Cognizant AI Lab and UT Austin has demonstrated something remarkable: an AI system that completed over one million sequential steps with zero errors. Not 99.9% accuracy. Not "acceptably low" failure rates. Zero. The system, called MAKER, didn't achieve this by using a more powerful model—in fact, it deliberately used smaller, cheaper models. Instead, it treated AI unreliability as what it always was: a systems design problem.
For those of us building AI automation systems, this research validates an approach we've been developing for years. The path to reliable AI isn't through the next frontier model—it's through intelligent orchestration. Let me break down exactly how MAKER works, why it matters, and what it means for the future of enterprise AI.
01.The Compounding Error Problem (Why Single-Agent AI Fails)
Here's the mathematical reality that makes single-agent AI fundamentally unreliable for complex tasks. Even a model with 99.9% per-step accuracy—which sounds excellent—faces exponential failure as tasks grow longer:
This isn't a model capability issue—it's probability theory. The formula is brutal: p_success = (per_step_accuracy)^num_steps. At one million steps, you'd need something like 99.99999% accuracy per step to have any reasonable chance of success. No current model comes close.
Prior benchmarks confirmed this limitation. In the Towers of Hanoi puzzle—a well-defined problem with an exact algorithmic solution—state-of-the-art reasoning models failed completely beyond 5-6 disks. That's fewer than 100 steps before the process "inevitably becomes derailed." Frontier models like Claude 3.7 Thinking and DeepSeek-R1 exhibited what researchers describe as "sharp reliability cliffs"—working fine up to a threshold, then failing completely.
The Fundamental Insight
The problem isn't that LLMs hallucinate—it's that errors compound without correction. A single wrong step early in a process creates an invalid state that cascades through all subsequent reasoning. This is a failure of architecture, not intelligence. And architectural problems have architectural solutions.
02.How MAKER Actually Works (The Three Pillars)
MAKER—which stands for Maximal Agentic decomposition, K-threshold Error mitigation, and Red-flagging—attacks the compounding error problem through architecture rather than brute-force model improvement. Here's exactly how each component works:
Pillar 1: Maximal Agentic Decomposition (MAD)
Instead of asking one agent to handle a complex task, MAKER breaks problems into the smallest possible subtasks—often just one decision per agent. Each microagent receives only the current state and the immediate rule to apply. It executes one action, then terminates. A fresh instance handles the next step.
Traditional Approach
- • One agent handles entire task
- • Accumulating context window
- • Context drift over time
- • Errors cascade to all future steps
MAKER Approach
- • Million microagents, one step each
- • Stateless execution
- • No context contamination
- • Errors isolated and correctable
Pillar 2: First-to-Ahead-by-K Voting
Multiple agents attempt the same step in parallel. The system accepts the first answer that achieves K more votes than any alternative. For the million-step benchmark, K=3 was sufficient—meaning an answer needed three more votes than competitors to be accepted.
The mathematics here are powerful. Small gains in per-step accuracy compound exponentially across millions of steps. If voting increases accuracy from 99.7% to 99.99%, the probability of completing a million steps goes from effectively zero to achievable. The voting requirement scales logarithmically with total steps, keeping costs manageable.
# Required votes scale logarithmically, not linearly
k_min = ceil(ln(t^(-m/s) - 1) / ln((1-p)/p)) = O(ln s)
# For 1,048,575 steps with 99.7% base accuracy:
k_min = 3 # Only need 3 votes ahead to achieve zero errorsPillar 3: Red-Flag Filtering
Here's a subtle but critical insight: logic errors often manifest as formatting errors first. When an LLM is confused about what to do, it tends to ramble, exceed token limits, or produce malformed output. MAKER exploits this by treating structural issues as diagnostic signals.
If an output exceeds ~700 tokens or fails formatting checks, the system immediately discards it and retries. The research found that overly-long responses had a ~90% error rate compared to ~0.1% for properly-formatted short responses. Red-flagging reduces correlated errors—the situation where both votes on a step are wrong in the same way.
"By smashing intelligence into a million pieces, it is possible to build AI that is efficient, safe, and reliable."
03.Why This Was Always a Systems Design Problem (Editorial)
I need to be honest here: this research confirms something I've believed for a long time. AI hallucination was never the mystical, unsolvable problem it was portrayed as. It was a technical barrier—a systems design challenge that the industry was approaching from the wrong direction.
The Industry's Wrong Turn
For years, the dominant narrative has been: "We need bigger models to reduce hallucination." Every frontier lab has been racing to train larger networks, consume more data, burn more compute. The implicit assumption? AI reliability is a capability problem that scales with model size.
But MAKER proves this was always backwards. Reliability scales with architecture, not parameters. A well-designed system using small, cheap models achieved what no frontier model could—not because it was smarter per step, but because it was smarter about how steps were organised.
Think about how we solve reliability problems in every other engineering discipline. We don't build aeroplanes that never fail—we build systems with redundancy, error correction, and graceful degradation. We don't design bridges that require perfect materials—we engineer safety factors and load distribution. The entire field of fault-tolerant computing exists because we accepted that individual components fail and designed around it.
Why did we think AI would be different? Why did we expect a single neural network to be infallible when we'd never expect that of any other system?
The answer, I suspect, is partly marketing. "We're building AGI" sounds more exciting than "We're building better orchestration." The model-scaling narrative justified massive investments and captured imagination. But it also led us down a path where we kept throwing resources at model size while ignoring architectural solutions that were always available.
What This Means for Practitioners
For those of us building real AI automation systems, MAKER provides validation and a roadmap. The principles aren't new—decomposition, redundancy, error checking—but the proof that they work at million-step scale removes any doubt about their viability.
The hallucination problem isn't waiting for GPT-5 or Claude 4. It's solvable today with careful systems engineering. That's not a limitation—it's an opportunity.
04.The Surprising Economics: Why Smaller Models Win
Perhaps the most counterintuitive finding from the MAKER research is the cost analysis. Conventional wisdom says frontier models produce better results. For single queries, that's often true. But for orchestrated systems with error correction, the economics flip entirely.
| Model | Per-Step Error Rate | Cost for Million Steps | Reliability |
|---|---|---|---|
| gpt-4.1-mini | 0.22% | $3,500 - $4,900 | Zero errors achieved |
| gpt-oss-20B | ~0.3% | $1,700 | Zero errors achieved |
| o3-mini (reasoning) | 0.18% | $9,400 | Zero errors achieved |
| haiku-4.5 | Higher | $71,200 | Achievable but expensive |
The cheapest reliable option—gpt-oss-20B at $1,700—costs 97.6% less than the most expensive approach. And here's the key insight: the cheaper models weren't just "good enough." They achieved identical reliability—zero errors—because the architecture compensated for lower per-step accuracy through voting.
Why This Inverts the Scaling Narrative
The industry has operated on an implicit assumption: if you need better AI results, you need a better (read: bigger, more expensive) model. MAKER shows this is false for many use cases.
When you can correct errors through voting, the marginal value of higher per-step accuracy decreases. A small model that's wrong 0.3% of the time costs vastly less than a large model that's wrong 0.2% of the time—but with voting, both achieve zero errors. The extra accuracy of the expensive model doesn't translate to better outcomes; it just burns money.
This has profound implications for enterprise AI deployment. Organisations don't need to wait for cheaper frontier models or negotiate expensive API contracts. They can achieve production-grade reliability today by investing in architecture rather than compute. For businesses building AI capabilities, this dramatically lowers the barrier to reliable automation.
05.Limitations and Honest Critiques
No breakthrough is without caveats, and intellectual honesty requires acknowledging MAKER's limitations. Critics have raised valid points that deserve consideration:
The Towers of Hanoi is a "Toy Problem"
The benchmark is a recursively-structured puzzle with a known algorithmic solution. Critics argue this doesn't represent "real-world" complexity where dependencies are tightly coupled and steps aren't naturally decomposable. This is fair—but the research explicitly addresses scalability principles, not claims about universal applicability.
Decomposition Requires Human Design
MAKER doesn't solve the meta-problem of how to decompose unfamiliar tasks. The framework relies on human-written prompts that define what constitutes an atomic step. For novel problems without clear structure, this decomposition expertise must come from somewhere.
Context Loss Across Steps
Stateless execution means later steps lose context from earlier reasoning. For problems where decisions depend on accumulated understanding (not just current state), radical decomposition may create incompatibilities. Real workflows often need backtracking and conditional branching that pure sequential voting can't handle.
My Take on the Critiques
These limitations are real, but they're also engineering problems rather than fundamental barriers. The Towers of Hanoi may be a toy problem, but it demonstrated principles that apply broadly. NASA didn't invent fault-tolerant computing by solving realistic problems first—they proved the concepts worked, then applied them to space missions.
The decomposition challenge is where human expertise meets AI capability. Defining atomic steps for business workflows is exactly what AI consultants and automation specialists do. MAKER doesn't eliminate the need for that expertise—it makes the expertise more valuable by enabling reliable execution once decomposition is achieved.
06.What This Means for AI Automation
MAKER demonstrates the first clear case of "multi-agent advantage"—where coordinated AI systems achieve results impossible for any single model. This validates the architectural approach that companies building serious AI automation have been developing. Here's what changes:
Reliability Is Achievable Today
The waiting game is over. Organisations don't need to delay AI automation projects hoping the next model will be reliable enough. With proper architecture, current models can achieve production-grade accuracy for complex, multi-step workflows.
Cost Barriers Are Lower Than Thought
Enterprise AI doesn't require enterprise budgets for API costs. Small, efficient models in well-designed systems outperform expensive frontier models. This democratises access to reliable AI automation for SMBs and startups.
Architecture Expertise Becomes Critical
The competitive advantage in AI shifts from "which model do you use" to "how well do you orchestrate." Systems design, decomposition strategies, and error-handling architectures become the differentiators. This is good news for thoughtful practitioners.
The Path Forward
MAKER provides a blueprint, but implementation requires expertise. Decomposing business workflows into atomic steps, designing voting strategies, and building red-flag filtering for specific domains—these are engineering challenges that require understanding both the business process and the AI architecture. This is exactly the kind of work that makes AI automation consulting valuable.
07.Frequently Asked Questions
What is the MAKER framework?
MAKER (Maximal Agentic decomposition, K-threshold Error mitigation, and Red-flagging) is an AI framework developed by Cognizant that achieves zero errors across over one million LLM steps. It uses extreme task decomposition, multi-agent voting, and output filtering to overcome the reliability limitations of large language models. The framework distributes tasks across millions of stateless microagents, each handling one atomic decision.
How does MAKER solve AI hallucination?
MAKER treats hallucination as a systems design problem rather than a model capability issue. By breaking tasks into atomic steps (one decision per agent), using stateless execution, and implementing voting across multiple parallel agents, MAKER prevents errors from compounding. Red-flag filtering catches confused outputs before they propagate, achieving perfect accuracy across 1,048,575 sequential steps.
Why is this research significant for AI automation?
MAKER demonstrates the first "multi-agent advantage"—where coordinated AI systems achieve results impossible for any single model. It proves that reliable AI at scale doesn't require bigger models or more compute; it requires better architecture. This has profound implications for enterprise AI deployment, suggesting that systems engineering, not model scaling, is the path to production-ready AI automation.
What models does MAKER use?
Surprisingly, MAKER achieves best results with smaller, non-reasoning models like gpt-4.1-mini and gpt-oss-20B rather than expensive frontier models. The research found these provide the best reliability-per-dollar, with costs around $1,700-$4,900 for a million-step task compared to $71,200 for larger models. This inverts the assumption that better AI requires bigger models.
What are the limitations of the MAKER approach?
MAKER works best for tasks that can be naturally decomposed into independent atomic steps with well-defined state. The Towers of Hanoi benchmark, while demonstrating the principle, is a recursively structured problem. Real-world tasks with tight coupling, non-linear dependencies, or requiring backtracking may be harder to decompose. The approach also requires human-defined decomposition strategies for unfamiliar problems.
How does this relate to AI agents and automation?
MAKER validates the multi-agent architecture that companies building AI automation have been developing. Rather than relying on a single powerful AI to handle complex workflows, orchestrating many focused agents with error correction produces more reliable results. This pattern applies directly to enterprise automation: document processing, data pipelines, and business workflows can all benefit from decomposition and voting.
The Paradigm Shift Is Here
MAKER isn't just a research paper—it's a proof point that changes how we should think about AI reliability. The hallucination problem that has plagued LLM deployment isn't waiting for some future breakthrough. It's solvable now, with current models, through careful systems design.
For those of us who have long argued that AI unreliability is fundamentally an engineering challenge rather than an intelligence limitation, this research is vindication. The industry spent years and billions chasing model scale when the answer was architectural sophistication. Better AI systems don't require smarter models—they require smarter systems.
The implications ripple outward: enterprise AI deployment becomes feasible with current technology, cost barriers drop by orders of magnitude, and the competitive advantage shifts from API budget to architectural expertise. For organisations serious about AI automation, the path is clear. Stop waiting for the next model. Start building better systems.