Anthropic dropped Claude Opus 4.6 on February 5, 2026. Within 48 hours, my Twitter feed was full of developers claiming it managed their entire engineering org, refactored million-line codebases, and found security vulnerabilities that humans missed for years.
Some of this is hype. But the benchmarks back up at least part of the story. Opus 4.6 scores 68.8% on ARC AGI 2—nearly double what Opus 4.5 managed. It hits 84% on BrowseComp for web research tasks. And here’s the part that actually matters: it maintains 76% accuracy on needle-in-a-haystack tests at 1 million tokens, while Sonnet 4.5 falls apart at 18.5%.
That last number is why people are paying attention. A context window that actually works at scale changes what you can build.
The 1M Context Window That Actually Works
Every frontier lab has been racing to announce bigger context windows. 200K, 500K, 1M tokens. The problem is most of them suffer from what developers call “context rot”—the model starts forgetting things or hallucinating once you get past a certain point.
Opus 4.6 is the first Opus model with a 1M token context window, and according to Anthropic’s benchmarks, it actually uses that context. On MRCR v2—an 8-needle test where the model has to find multiple specific facts buried in massive amounts of text—Opus 4.6 scores 76% at 1M tokens. Sonnet 4.5 gets 18.5%. At 256K tokens, Opus 4.6 hits 93%.
One million tokens is roughly 750,000 words. That’s multiple codebases, entire research papers, comprehensive financial reports, or months of conversation history.
A developer on Reddit described migrating an entire authentication system from IdentityServer4 to Keycloak using the full 1M context window. The model tracked architectural decisions, security implications, and implementation details across hundreds of files. Another user reported processing quarterly reports, regulatory filings, and market research simultaneously, with Claude maintaining coherent analysis across all documents.
The technical achievement here isn’t just expanding memory. Anthropic added context compaction—a beta feature that automatically summarizes earlier parts of conversations when you approach the window limit. This pairs with improved retrieval mechanisms to prevent the model from “forgetting” information as context grows.
What this means in practice: tasks that used to require complex RAG pipelines, careful context management, and multiple API calls can now run in a single session. Whether that’s worth the cost depends on your use case, but the capability is there.
Agent Teams: Multi-Agent Coordination That Actually Works
The most interesting new feature is agent teams, currently in research preview for Claude Code. Instead of one agent working through tasks sequentially, you can spin up multiple agents that work in parallel, coordinating autonomously.
The architecture is straightforward: one orchestrator agent spawns multiple sub-agents, each handling a specific subtask. These sub-agents work independently on things like codebase reviews or documentation analysis, then coordinate their findings.
One developer reported that Opus 4.6 autonomously managed a 50-person organization across six repositories, handling both product and organizational decisions. The model knew when to escalate to a human and when to proceed independently.
In cybersecurity testing, Opus 4.6 with agent teams produced the best results 38 out of 40 times in blind rankings against Claude 4.5 models. Each model ran end-to-end on the same agentic harness with up to 9 sub-agents and over 100 tool calls.
Another developer working on a large codebase migration finished a two-day authentication refactor in 90 minutes using agent teams. The model split the work across multiple agents—one handling database migrations, another updating API endpoints, a third refactoring authentication middleware—all working in parallel.
Scott White, Head of Product at Anthropic, compared it to having a talented team of humans working for you. The ability to parallelize cognitive work is genuinely new.
Technical Changes Under the Hood
Adaptive Thinking Mode
Previous Claude models had a binary choice: enable extended thinking or disable it. Opus 4.6 introduces adaptive thinking, where the model decides when and how much to think based on the task.
At the default effort level (high), Claude engages extended thinking when useful but skips it for simpler problems. The model doesn’t just decide whether to think—it decides how deeply to think. This meta-cognitive capability lets it allocate computational resources efficiently.
Effort Parameter and Granular Control
Anthropic added four effort levels: low, medium, high (default), and max. The new max effort level pushes the model to its reasoning limits. This lets developers fine-tune the cost-quality tradeoff for specific use cases.
A financial analysis task might use max effort for critical calculations, while a code formatting task might use low effort to minimize latency and cost.
Fast Mode: 2.5x Speed Boost
Fast mode delivers up to 2.5x faster output token generation at premium pricing ($30/$150 per million tokens). This is the same model running with faster inference, not a different model. No change to intelligence or capabilities, just reduced latency.
Developers report that fast mode makes Claude feel more responsive and natural.
128K Output Tokens
Opus 4.6 supports up to 128K output tokens, doubling the previous 64K limit. Combined with the 1M input context window, this creates a model that can handle enterprise-scale tasks.
Fine-Grained Tool Streaming
Fine-grained tool streaming is now generally available. This lets developers process tool calls incrementally as they’re generated, rather than waiting for the entire response. For agentic applications that make dozens or hundreds of tool calls, this improves perceived responsiveness.
Benchmark Performance: What Actually Changed
The benchmark results tell a specific story: Anthropic optimized for agentic workflows and practical deployment.
Agentic Capabilities
On Terminal-Bench 2.0, which tests command-line proficiency, Opus 4.6 scores 65.4%. That’s up from Opus 4.5’s 59.8%, ahead of Sonnet 4.5’s 51.0% and Gemini 3 Pro’s 56.2%. It trails GPT-5.2’s 64.7%.
The BrowseComp benchmark shows Opus 4.6’s biggest improvement: 84.0%, up 16.2 percentage points from Opus 4.5’s 67.8%. This beats Sonnet 4.5’s 43.9%, Gemini 3 Pro’s 59.2%, and GPT-5.2 Pro’s 77.9%.
On OSWorld, which tests computer use through GUI interactions, Opus 4.6 delivers 72.7%, up from Opus 4.5’s 66.3%.
Abstract Reasoning
The most dramatic improvement comes on ARC AGI 2, which tests abstract reasoning on novel problems. Opus 4.6 scores 68.8%, nearly doubling Opus 4.5’s 37.6% and beating Gemini 3 Pro’s 45.1% and GPT-5.2 Pro’s 54.2%.
This 31.2 percentage point leap is one of the largest single-benchmark improvements I’ve seen in a frontier model update. ARC AGI is specifically designed to measure general intelligence rather than learned knowledge, which makes this result harder to dismiss as benchmark optimization.
Knowledge Work
On GDPVal-AA, which measures performance on economically valuable knowledge work tasks, Opus 4.6 scores 1606 Elo. That’s 190 points ahead of Opus 4.5’s 1416 and 144 points ahead of GPT-5.2’s 1462. This translates to Opus 4.6 beating GPT-5.2 approximately 70% of the time.
The Finance Agent benchmark shows Opus 4.6 at 60.7%, beating Opus 4.5’s 55.9%, Sonnet 4.5’s 54.2%, GPT-5.2’s 56.6%, and Gemini 3 Pro’s 44.1%.
Tool Use
On τ2-bench, which tests tool-calling capabilities, Opus 4.6 scores 91.9% on Retail scenarios and 99.3% on Telecom scenarios. These results beat all competitors including GPT-5.2’s 82.0% on Retail.
On MCP Atlas, which tests scaled tool use with many tools simultaneously, Opus 4.6 scores 59.5%, falling behind Opus 4.5’s 62.3%. This is one of the few areas where the model regresses. At max effort, the score improves to 62.7%, matching the previous version.
Coding
On SWE-bench Verified, which tests real-world software engineering, Opus 4.6 achieves 80.8%, essentially matching Opus 4.5’s 80.9% and GPT-5.2’s 80.0%. Anthropic maintained elite coding performance while optimizing other capabilities.
Real-World Applications
Software Development at Scale
Developers report using Opus 4.6 for multi-million-line codebase migrations. One described the model handling a codebase migration “like a senior engineer,” demonstrating judgment about when to refactor, when to preserve existing patterns, and when to ask for clarification.
In code review scenarios, Opus 4.6 in Devin Review has increased bug-catching rates. The model considers edge cases that other models miss. Early access partners describe its ability to navigate large codebases and identify the right changes as “state of the art.”
Cybersecurity and Vulnerability Research
Anthropic used Claude Opus 4.6 to find over 500 previously unknown high-severity security flaws in open-source libraries. This demonstrates the model’s enhanced cybersecurity abilities and raises questions about dual-use risks.
To address potential misuse, Anthropic developed six new cybersecurity probes to track different forms of potential misuse. The company is also using the model to find and patch vulnerabilities in open-source software.
Legal and Financial Analysis
On BigLaw Bench, Opus 4.6 achieved 90.2%, with 40% perfect scores and 84% above 0.8. Law firms are using the model to analyze multi-source content across legal, financial, and technical domains, with evaluations showing a 10% lift in performance over previous models.
In financial services, analysts report using the full 1M context window to process quarterly reports, regulatory filings, and market research simultaneously, with the model maintaining coherent analysis across all documents.
Office Productivity
Claude’s integration with Excel and PowerPoint (in research preview) shows how frontier models can enhance everyday productivity tools. In Excel, Opus 4.6 handles long-running tasks, can plan before acting, ingest unstructured data and infer the right structure without guidance, and handle multi-step changes in one pass.
One user described the performance jump as “almost unbelievable,” noting that “real-world tasks that were challenging for Opus [4.5] suddenly became easy.”
Safety and Alignment
Opus 4.6’s intelligence gains don’t come at the cost of safety. On Anthropic’s automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse.
It’s just as well-aligned as Opus 4.5, which was Anthropic’s most-aligned frontier model to date. Opus 4.6 also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.
Anthropic ran the most comprehensive set of safety evaluations of any model for Opus 4.6, including new evaluations for user wellbeing, more complex tests of the model’s ability to refuse potentially dangerous requests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions.
The company also experimented with new methods from interpretability to understand why the model behaves in certain ways and to catch problems that standard testing might miss.
What This Means for the LLM Landscape
The Agentic AI Shift
The emphasis on agentic capabilities—agent teams, improved tool use, better long-running task performance—signals where Anthropic thinks the industry is headed. The next frontier in AI isn’t just making models smarter; it’s making them more autonomous and better at coordinating complex multi-step tasks.
Opus 4.6’s improvements in computer use, web search, and terminal operations show that Anthropic is optimizing specifically for practical agent deployments. This is a bet that the future of AI lies in agents that complete tasks, not chatbots that answer questions.
Context Windows That Work
The industry’s context window arms race has been somewhat hollow. Models boast large windows that suffer from severe performance degradation. Opus 4.6’s 76% accuracy at 1M tokens sets a new standard: context windows must not only be large, they must be usable.
This will pressure competitors to focus not just on expanding context windows, but on maintaining performance at scale. The technical innovations behind Opus 4.6’s context handling—improved retrieval mechanisms, context compaction, better attention mechanisms—are the real competitive advantage.
The Multi-Agent Future
Agent teams change how AI systems can be architected. Rather than building increasingly large monolithic models, the future may lie in coordinating multiple specialized agents. This approach offers better parallelization, more efficient resource allocation, and the ability to combine different models or capabilities.
The success of agent teams in Opus 4.6 will likely inspire similar features in competing models and frameworks.
Enterprise Adoption
With its combination of long context, agent capabilities, and strong performance on knowledge work tasks, Opus 4.6 is positioned as an enterprise-ready model. The 190-point Elo improvement on GDPVal-AA, the 90.2% BigLaw Bench score, and the 60.7% Finance Agent performance translate directly to business value.
The integration with office tools like Excel and PowerPoint lowers the barrier to enterprise adoption. Knowledge workers can use Claude directly in the tools they already know.
What Comes Next
Scaling Multi-Agent Systems
Agent teams are currently in research preview and work best for specific types of tasks—independent, read-heavy work that can be parallelized. The next challenge is scaling multi-agent coordination to handle more complex dependencies, tighter coupling between tasks, and larger teams of agents.
The slight regression on MCP Atlas suggests that coordinating many tools simultaneously remains challenging.
Context Beyond 1M Tokens
If 1M tokens enables transformative applications, what becomes possible at 10M or 100M tokens? The technical challenges are significant—attention mechanisms scale quadratically with context length, and maintaining performance becomes increasingly difficult. But the potential applications are compelling: reasoning over entire codebases, processing years of conversation history, or analyzing comprehensive research corpora.
Adaptive Intelligence
The adaptive thinking feature in Opus 4.6 is a form of meta-cognition—the model deciding how much to think about a problem. This capability could be extended further: models that dynamically allocate resources, choose which tools to use, decide when to spawn sub-agents, or determine when to ask for human input.
This adaptive intelligence could make AI systems more efficient, more autonomous, and more aligned with human preferences.
Safety at Scale
As models become more capable and more autonomous, safety challenges intensify. Opus 4.6’s enhanced cybersecurity capabilities demonstrate the dual-use nature of frontier AI. The model can find vulnerabilities—but so could malicious actors using similar models.
Anthropic’s approach—developing new safety probes, accelerating defensive applications, and maintaining alignment while increasing capability—provides a template for responsible AI development. But as models become more powerful, the safety challenges will only grow more complex.
Conclusion
Claude Opus 4.6 is more than an incremental improvement. The 1M context window that actually maintains performance, the agent teams functionality, and the improvements in agentic capabilities combine to create a model that’s qualitatively different from what came before.
The benchmark results are impressive—68.8% on ARC AGI 2, 84% on BrowseComp, 1606 Elo on GDPVal-AA. But the real story is in the applications: managing 50-person organizations, conducting cybersecurity investigations with 95% success rates, completing two-day refactors in 90 minutes.
This release changes the LLM landscape because it enables entirely new categories of applications. AI agents that can work in teams, maintain context over massive conversations, and exercise judgment about when to think deeply and when to move quickly.
The implications: pressure on competitors to deliver context windows that actually work, a shift toward multi-agent architectures, acceleration of enterprise AI adoption, and new challenges in AI safety and alignment.
We’re entering an era where the frontier isn’t just about making models smarter—it’s about making them more autonomous and more capable of sustained, complex work. Claude Opus 4.6 is the first model that delivers on that promise.
The question now is what we’ll build with it.