AGI Progress Journal
Short records of significant developments and their impact on predictions.
February 20, 2026
Event: METR publishes updated Time Horizon 1.1 data — Claude Opus 4.6 sets record at 14.5 hours (870 min p50); GPT-5.3-Codex regresses; growth rate itself is ACCELERATING; best-fit model is accelerating exponential
Part 1: Updated METR Data (Feb 20, 2026)
METR updated their Time Horizon 1.1 leaderboard with evaluations of Claude Opus 4.6 and GPT-5.3-Codex. The results are significant.
New SOTA data points:
ModelRelease Datep50 HorizonChange from Previous SOTAClaude Opus 4.6Feb 5, 2026870 min (14.5 hrs)+121% vs GPT-5.2 (6.6h)GPT-5.3-CodexFeb 5, 2026390 min (6.5 hrs)-1% vs GPT-5.2 (REGRESSION)
Key surprise: GPT-5.3-Codex DOES NOT improve on GPT-5.2 for agent autonomy. Despite being a newer model with massive Terminal-Bench improvements (77.3%), its METR time horizon (390 min) is actually slightly lower than GPT-5.2 (394 min). Coding-focused improvements do not automatically translate to sustained agent autonomy.
Updated METR doubling times:
PeriodPrevious EstimateUpdated (Feb 20)ChangeAll-time (METR official)195.8 days182.7 days-13 daysFrom 2023+ (METR official)165.3 days122.6 days (CI: 99.6-149.1)-43 daysSince 2024 (our estimate, SOTA only)89 days100 days+11 daysLast 6 months (our estimate)—45.5 daysNEW
Important methodological note: The previous “89-day doubling since 2024” was from the pre-Opus 4.6 TH1.1 release. With more data points and METR’s methodology update, the from-2023 figure stabilizes at 122.6 days. Our own SOTA-only exponential fit gives 100 days from 2024+. The discrepancy between 89 and 100 days is because the earlier estimate had fewer data points.
Part 2: Statistical Model Fitting
We fit 7 statistical models to the 13 SOTA frontier data points. Results:
ModelR²RMSEParametersDoubling TimeExponential (2024+ data)0.9780.2322100 daysLogistic (S-curve)0.9750.2803Variable (asymptote: ~17 days)Accelerating Exponential0.9750.2843Variable (77 days instantaneous)Gompertz0.9730.2913VariableExponential (6 months)0.9730.079245.5 daysSuper-Exponential0.9710.3053VariableExponential (all data)0.9310.4672124 daysPower Law0.6291.0862—
Best-fit model: Accelerating exponential — log(p50) = 0.924 + 0.00096·t + 0.0000036·t²
The positive quadratic coefficient (c > 0) means the growth rate itself is increasing over time. This is the most important finding: the doubling time is not fixed — it is shrinking.
Instantaneous doubling times from the accelerating exponential:
PeriodInstantaneous DoublingMid-2024182 daysEarly 2025117 daysLate 202586 daysFeb 202677 days
This explains the apparent discrepancy: the “89-day” estimate from before was the slope at a particular point on an accelerating curve. The curve is now steeper.
Part 3: Projections
Milestone projections under different models:
MilestoneExponential (all, 124d)Exponential (2024+, 100d)Accelerating Exp1 day (24h)Sep 2026Jun 2026Apr 20261 week (168h)Aug 2027Mar 2027Oct 20262 weeks (336h)Dec 2027Jun 2027Dec 20261 month (720h)Apr 2028Oct 2027Mar 2027
“Superhuman coder” milestone (~60+ hours = 3600 min):
ModelProjected DateAI 2027 (120-day assumed)March 2027Exponential (all, 124d)Feb 2027Exponential (2024+, 100d)Oct 2026Accelerating exponentialJul 2026
Part 4: Opus 4.6 Anomaly Analysis
Opus 4.6’s 870 minutes is 1.95x what the all-data exponential predicts (446 min). This is a 1.43σ residual — within normal variation but at the high end. The jump from GPT-5.2 to Opus 4.6 (394 → 870 min in 56 days) implies a local doubling time of just 49 days.
This suggests Anthropic has made a specific architectural or training breakthrough for sustained agent autonomy that hasn’t been matched by OpenAI’s GPT-5.3-Codex (which explicitly focused on Terminal-Bench/coding rather than general autonomy).
Part 5: Impact on Predictions
This is a MAJOR update to our METR model. Key changes:
Doubling time correction: Our previous estimate of 89 days was close but the data now supports a more nuanced view: the best model is an accelerating exponential with instantaneous doubling currently at ~77 days and shrinking. The overall 2024+ doubling is ~100 days.
Opus 4.6 at 14.5 hours is a massive advance — Already 24% of the way to AI 2027’s “superhuman coder” milestone (60+ hours). At the accelerating exponential rate, this milestone arrives Jul 2026 — 8 months ahead of AI 2027’s March 2027 prediction.
GPT-5.3-Codex regression is significant — Shows that METR autonomy and coding benchmarks measure different things. Terminal-Bench improvements (+13pp) did not help METR. This means sustained multi-hour autonomous work requires something beyond coding skill.
The acceleration signal is robust — The positive quadratic coefficient in the accelerating exponential model is consistent across different time windows. The growth rate is genuinely increasing, not just noisy.
Logistic model shows possible asymptote at ~17 days (~410 hours) — If the logistic fit is correct, there may be a natural ceiling for METR-style tasks at about 2-3 weeks of human-equivalent work. However, this is uncertain with only 13 data points and the logistic may be overfitting to the current S-shape.
Scenario probability changes:
“Accelerating” scenario: 76% → 78% — Opus 4.6 14.5h is largest single SOTA jump; accelerating exponential validated; 24% of way to “superhuman coder”
“Unexpected Acceleration”: 11% → 12% — Instantaneous doubling of 77 days (and shrinking) is approaching recursive improvement territory
“Slowdown but Progress”: 10% → 8% — GPT-5.3 regression is counter-evidence for universal acceleration, but Opus 4.6 overwhelms it
“Punctuated Equilibrium”: 3% → 2% — Strongest possible counter-evidence to plateau
Revised METR projections table:
DatePrevious (89d doubling)Revised (accel. exp)StatusFeb 2026~10.7h14.5h (actual)✅ AHEADApr 2026~16.3h22.5hProjectionAug 2026~26h3.2 days (77h)ProjectionDec 2026~53h11.8 days (283h)ProjectionMar 2027~106h33 daysProjection
What to watch:
Does METR evaluate Gemini 3.1 Pro / Deep Think? (Would test whether reasoning improvements translate to autonomy)
Does the accelerating trend continue or flatten? (Next SOTA model will be critical)
Does GPT-5.3-Codex’s regression hold for other coding-focused models? (Tests whether METR and coding benchmarks are diverging permanently)
Does the logistic asymptote (~17 days) appear real? (Would imply a ceiling for this task suite)
Sources: https://metr.org/time-horizons/ (Feb 20, 2026 update), METR TH1.1 YAML data, statistical analysis (metr_analysis.py)
February 19, 2026 (Update 2)
Event: xAI launches Grok 4.2 public beta (~Feb 17) — “rapid learning” architecture with weekly improvement cycles; multi-agent collaborative architecture (4 agents per response); 256K context; no published benchmarks yet
Significance level: MODERATE — novel architecture claims but no hard metrics. Worth tracking as xAI’s 4th major Grok iteration, but cannot update benchmark tables without published scores.
Key details:
Public beta on X platform — opt-in selection “Grok 4.2 (Beta)”; full release expected before April 21, 2026
“Rapid learning” architecture — weekly improvement cycles instead of months-long retraining. If real, this is a fundamentally different development paradigm from other frontier labs
Multi-agent collaborative architecture — 4 independent agents synthesized into unified response per query
256K token context window (vs Gemini 3.1 Pro’s 1M, Opus 4.6’s 1M)
Medical and engineering specialization — Musk claims “correctly answering open-ended engineering questions” and “remarkably better than 4.1”
Musk claims “order of magnitude smarter and faster than Grok 4” — unverified, no benchmark data published
Stock-trading simulation claims — reportedly outperformed GPT-5.1, Gemini 3 Pro, and Claude 3.5 Sonnet in decision-making tasks (non-standard benchmark)
No ARC-AGI-2, SWE-Bench, HLE, GPQA, METR, or OSWorld scores published
Development timeline: ~3 months from Grok 4.1 (Nov 17, 2025) to 4.2 beta (mid-Feb 2026)
Assessment:
The “rapid learning” architecture claim is the most interesting aspect — if xAI can genuinely iterate weekly on a frontier model, this changes the competitive dynamics. However, this is unverified.
Without published benchmarks, we cannot determine if Grok 4.2 is actually frontier-competitive. Grok 4.1 was considered behind the leaders.
The multi-agent architecture (4 agents per response) is a different approach from Google’s reasoning distillation or Anthropic’s agent teams.
Does not change scenario probabilities — no quantifiable evidence to shift assessment.
Watch for: Published benchmark scores upon full release (expected March-April 2026); comparison to Opus 4.6, GPT-5.3-Codex, and Gemini 3.1 Pro on standard benchmarks.
February 19, 2026
Event: Google releases Gemini 3.1 Pro — first .1 upgrade in Gemini series; Deep Think reasoning distilled into base model; ARC-AGI-2 77.1% in general-purpose model; Google Antigravity agentic platform launches
Part 1: Gemini 3.1 Pro Benchmarks (Google — Feb 19, 2026)
Google released Gemini 3.1 Pro, a major upgrade to the core Gemini 3 Pro model. This is the first .1 LLM release in the Gemini series, consolidating the advanced reasoning breakthroughs from Deep Think into a more widely accessible baseline model.
Key benchmarks:
BenchmarkGemini 3.1 ProGemini 3 ProDeltaContextARC-AGI-277.1%31.1%+46.0%More than doubled; approaching Deep Think’s 84.6%Artificial Analysis Intelligence Index57——Well above median (26)
Comparison to frontier (Feb 19, 2026):
ModelARC-AGI-2TypeCostGemini 3 Deep Think84.6%Specialized reasoning mode$77.16/taskGemini 3.1 Pro77.1%General-purpose model$2/$12 per M tokensOpus 4.668.8%General-purpose modelStandard API pricingGPT-5.2 Pro54.2%Extended reasoning mode$21/$168 per M tokensHuman average~60%——
Important context:
This is a general-purpose model, NOT a specialized reasoning mode — unlike Deep Think, Gemini 3.1 Pro is usable for daily coding, chat, and agentic workflows at standard latency
ARC-AGI-2 77.1% in a base model — Gemini 3 Pro scored 31.1%; this is a +46 point jump. Deep Think scored 84.6% but costs $77.16/task. Gemini 3.1 Pro achieves ~91% of Deep Think’s reasoning at ~1/40th the cost
Deep Think reasoning distilled into base model — Validates the “inference-time scaling → distillation” pipeline. Expensive reasoning breakthroughs are being baked back into cheaper, faster models
Described as “comparable to Anthropic’s Sonnet 4.6 and Opus 4.6 and OpenAI’s GPT-5.2” — Google claims parity with frontier across benchmarks
Leads most benchmarks but trails Claude Opus 4.6 in some tasks — Specific areas where Opus leads not detailed
Improved software engineering behavior — In GitHub Copilot testing, “excels on effective and efficient edit-then-test loops with high tool precision, achieving strong resolution”
Improved token efficiency — More grounded, factually consistent experience vs predecessors
No specific SWE-Bench, HLE, METR, or OSWorld scores published — Key metrics still missing for full competitive comparison
Part 2: Technical Specifications
SpecValueInput context1,048,576 tokens (1M)Output limit65,536 tokensModalitiesText, image, video, audio, PDF (native multimodal)Knowledge cutoffJanuary 2025Code processingUp to 30,000 lines per uploadPricing (<200K tokens)$2/M input, $12/M outputPricing (200K-1M tokens)$4/M input, $18/M outputStatusPublic preview
Pricing note: Identical to Gemini 3 Pro — the +46 point ARC-AGI-2 reasoning improvement comes at zero price increase. This is a free capability upgrade for existing Gemini 3 Pro users.
Part 3: Google Antigravity — New Agentic Development Platform
Google launched Antigravity, a new agentic development platform where Gemini 3.1 Pro is available in preview. This is Google’s answer to Anthropic’s Claude Code agent teams and OpenAI’s Codex platform.
Availability:
Consumers: Gemini app, NotebookLM (higher limits for AI Pro/Ultra subscribers)
Developers: Gemini API, Google AI Studio, Gemini CLI, Google Antigravity, Android Studio
Enterprise: Vertex AI, Gemini Enterprise
Demonstrated capabilities:
Live aerospace dashboard synthesizing ISS telemetry streams
Code-based SVG animation generation (vector, scalable, small file size)
Interactive 3D experiences with hand-tracking and adaptive audio
Creative reasoning for design tasks (interpreting literary tone for UI design)
Agentic workflows with precise tool usage and multi-step execution
Finance and spreadsheet domain improvements
Part 4: Impact on Predictions
This is a significant development. Here’s why:
1. Deep Think reasoning is “trickling down” to base models
ModelARC-AGI-2TypeCost per task (ARC)Gemini 3 Pro (~Nov 2025)31.1%Base modelStandardGemini 3 Deep Think (Feb 12)84.6%Specialized mode$77.16Gemini 3.1 Pro (Feb 19)77.1%Base model~Standard
The pattern: (1) Develop expensive reasoning via inference-time compute → (2) Distill back into base model at standard cost. This is the reasoning distillation pipeline — and it works. Gemini 3.1 Pro captures ~91% of Deep Think’s ARC-AGI-2 performance at a fraction of the cost and latency.
Implication: Every Deep Think breakthrough will eventually become a base model capability. The frontier of cheap, fast reasoning is now ~1-2 months behind the expensive specialized frontier.
2. ARC-AGI-2 77.1% in a general-purpose model redefines the baseline
ModelARC-AGI-2DateTypeGemini 3 Deep Think84.6%Feb 12SpecializedGemini 3.1 Pro77.1%Feb 19General-purposeOpus 4.668.8%Feb 5General-purposeHuman average~60%——GPT-5.2 Pro54.2%Dec 2025Extended reasoningGemini 3 Pro31.1%Nov 2025General-purpose
Among general-purpose models (not specialized reasoning modes), Gemini 3.1 Pro now leads ARC-AGI-2 by +8.3% over Opus 4.6. This is the first general-purpose model to exceed 75% on ARC-AGI-2 — well above human average (~60%).
3. Agentic platform competition intensifies
LabAgentic PlatformKey FeatureAnthropicClaude Code agent teams16 parallel agents, git coordinationOpenAIGPT-5.3-CodexSelf-referential development, API withheldGoogleAntigravityNew agentic dev platform, Gemini CLI
Google Antigravity joins the agentic platform race. Three major labs now have dedicated agentic development platforms, confirming that the “agent OS” paradigm is the primary battleground for 2026.
4. Pricing unchanged = reasoning is getting cheaper exponentially
Gemini 3.1 Pro at $2/$12 per M tokens delivers 77.1% ARC-AGI-2 — the same reasoning level that cost $77.16/task via Deep Think one week ago. This is a ~40x cost reduction for comparable reasoning capability in just 7 days. Combined with Rubin’s 10x inference cost reduction, the economics of AI reasoning are improving at an extraordinary pace.
5. Software engineering improvements noted but unquantified
Google claims “improved software engineering behavior” and GitHub Copilot testing shows strong edit-then-test loops with high tool precision. However, no SWE-Bench score was published. This is notable — Google may still trail on practical coding benchmarks (SWE-Bench plateau at ~80-82% for leaders). Without specific numbers, the coding claim remains unverified.
6. Caveats — what we don’t know
No SWE-Bench score — Coding capability vs Sonnet 5 (82.1%) and Opus 4.6 (80.8%) unknown
No HLE score — Expert knowledge performance unknown
No METR data — Agent autonomy duration unknown
No OSWorld score — Computer use capability unknown
No GPQA score — Science/math performance vs GPT-5.2 Pro (93.2%) unknown
“Comparable to” claim unverified — Google says comparable to Opus 4.6 and GPT-5.2, but specific benchmark-by-benchmark data is sparse
Public preview only — Not yet GA; early adopters report response delays and high-demand errors
Knowledge cutoff January 2025 — 13 months stale
Part 5: Updated Competitive Landscape (Feb 19, 2026)
DomainLeaderScoreRunner-upScoreAbstract Reasoning (ARC-AGI-2, specialized)Gemini 3 Deep Think84.6%Gemini 3.1 Pro77.1%Abstract Reasoning (ARC-AGI-2, general)Gemini 3.1 Pro77.1%Opus 4.668.8%Expert Knowledge (HLE, no tools)Gemini 3 Deep Think48.4%GPT-5.2 Pro36.6%Expert Knowledge (HLE, w/ tools)Opus 4.653.1%GPT-5.2 Pro50.0%Professional Work (GDPval-AA)Opus 4.61606 EloGPT-5.21462Coding (SWE-Bench Verified)Sonnet 582.1%Opus 4.680.8%Agentic Coding (Terminal-Bench)GPT-5.3-Codex77.3%Opus 4.665.4%Computer Use (OSWorld)Opus 4.672.7%GPT-5.3-Codex64.7%Competitive Programming (Codeforces)Gemini 3 Deep Think3455 Elo——Math Research (autonomous papers)Gemini Deep Think (Aletheia)PublishableGPT-5.2 ProLow-hanging fruitCybersecurityGPT-5.3-Codex77.6%GPT-5.2-Codex67.4%Agent Autonomy (METR)GPT-5.26.6hOpus 4.5~5.3h
Key shift: Google now holds the top TWO spots on ARC-AGI-2 (Deep Think 84.6% specialized, 3.1 Pro 77.1% general-purpose). The “reasoning distillation” pipeline gives them a structural advantage in abstract reasoning.
Part 6: Scenario Probability Updates
CategoryAssessmentEvidenceAbstract reasoning (ARC-AGI-2)ACCELERATING (strengthened)77.1% in base model (+46 from Gemini 3 Pro); Deep Think distillation validatedReasoning cost economicsACCELERATING (new signal)~91% of Deep Think reasoning at ~1/40th cost in 7 daysAgentic platformsACCELERATINGGoogle Antigravity joins Anthropic + OpenAI; three-way platform raceCoding (SWE-Bench)PLATEAUING (unchanged)No new SWE-Bench data from GoogleAgent autonomy (METR)ACCELERATING (unchanged)No new data; 89-day doubling holdsComputer use (OSWorld)AT HUMAN LEVEL (unchanged)No new dataCompetitive intensityINCREASING5th major release in 16 days (Sonnet 5 → Opus 4.6 → GPT-5.3-Codex → Deep Think → 3.1 Pro)
Scenario probability changes:
“Accelerating” scenario: 74% → 76% — Reasoning distillation pipeline validated (Deep Think → base model in 7 days); general-purpose ARC-AGI-2 77.1% well above human average; Antigravity adds third agentic platform; 5 releases in 16 days
“Unexpected Acceleration”: 11% (unchanged) — Distillation speed is notable but within expected range
“Slowdown but Progress”: 11% → 10% — Reasoning cost dropping ~40x in 7 days is strong counter-evidence to slowdown
“Punctuated Equilibrium”: 4% → 3% — Zero evidence of plateau; fifth release in 16 days
What to watch next:
Does Gemini 3.1 Pro get specific SWE-Bench, HLE, GPQA, OSWorld scores published? (Would clarify the “comparable to Opus 4.6” claim)
Does METR evaluate Gemini 3.1 Pro? (Would test if reasoning improvements translate to agent autonomy)
How quickly do Anthropic and OpenAI respond? (Opus 4.7? GPT-5.4?)
Does the reasoning distillation pipeline get adopted by other labs? (Would compress all reasoning timelines)
Does Google Antigravity gain traction vs Claude Code and Codex? (Platform competition)
DeepSeek V4 still expected — will open-source match 77.1% ARC-AGI-2?
Sources: Google Keyword blog (Feb 19, 2026), Perplexity search, Artificial Analysis Intelligence Index
February 12, 2026
Event: Google releases major Gemini 3 Deep Think upgrade — ARC-AGI-2 record 84.6%, Codeforces 3455 Elo, Aletheia math agent writes autonomous publishable paper; DeepMind publishes two research papers on AI-accelerated science
Part 1: Gemini 3 Deep Think Benchmarks (Google — Feb 12, 2026)
Google released a major upgrade to Gemini 3 Deep Think, their specialized inference-time reasoning mode. Available to Google AI Ultra subscribers and via Gemini API (early access program).
Key benchmarks:
BenchmarkGemini 3 Deep ThinkPrevious BestPrevious LeaderDeltaARC-AGI-284.6%68.8% (Opus 4.6); 45.1% (prev Deep Think)Opus 4.6+15.8% (vs Opus); +39.5% (vs prev Deep Think)HLE (no tools)48.4%36.6% (no tools)GPT-5.2 Pro+11.8%Codeforces Elo3455——Legendary GrandmasterIMO 2025Gold medalGold medal (prev Deep Think)—MaintainedIPhO 2025 (written)Gold medal——NEWIChO 2025 (written)Gold medal——NEWCMT-Benchmark50.5%——Advanced theoretical physicsIMO-ProofBench Advanced~90%~65.7% (Jul 2025 Deep Think)—+24.3%
Important context:
Deep Think is a specialized reasoning mode, not a general model — high latency, high compute per query
HLE 48.4% is WITHOUT tools — Opus 4.6 leads at 53.1% WITH tools. Apples-to-oranges comparison. HLE no-tools record was 36.6% (GPT-5.2 Pro), so Deep Think leads that category by +11.8%
ARC-AGI-2 84.6% is the headline — verified by ARC Prize Foundation. This is +15.8% over Opus 4.6 (68.8%) and well above human average (~60%)
Codeforces 3455 — places model at “Legendary Grandmaster” level (top competitive programmers worldwide)
Previous Deep Think scored 45.1% on ARC-AGI-2 (~Nov 2025) — so the upgrade went 45.1% → 84.6% in ~3 months, an 88% improvement
Cost per task: $77.16 (current Deep Think at 84.6%) — expensive specialized reasoning mode. Open-source Poetiq system achieves same 84.6% at $30.57/task (40% of Deep Think cost). Previous Deep Think (~Nov 2025, 45.1%) cost unknown.
Deep Think uses ~138,000 reasoning tokens per ARC task vs ~96 tokens for standard Gemini 3 Pro — the extended reasoning is what produces the +53.5 point gain but at significant compute cost. This is the COST of inference-time scaling, not a token efficiency improvement.
ARC-AGI-3 being built — ARC Prize Foundation is building a fundamentally new interactive/agentic benchmark (agents explore environments, test hypotheses) that complements ARC-AGI-2 rather than replacing it. ARC-AGI-2 remains the primary static reasoning benchmark.
No SWE-Bench or coding benchmarks reported — Google may not lead on practical coding tasks
No METR time-horizon data — agent autonomy unknown for this mode
Available to Google AI Ultra ($~42/month) and Gemini API early access
Part 2: Aletheia — Autonomous Math Research Agent (DeepMind Paper, Feb 11, 2026)
Paper: “Towards Autonomous Mathematics Research” (Feng, Trinh, Bingham, et al.)
What it is: A math research agent powered by Gemini Deep Think, featuring a three-component architecture:
Generator — produces candidate solutions
Verifier — natural language verification to identify logical flaws
Reviser — corrects solutions based on verifier feedback
Can admit failure — prevents wasteful computation on unsolvable problems
Integrates Google Search + web browsing to navigate mathematical literature and prevent hallucinations
Benchmark performance:
BenchmarkAletheiaDeep Think (Jan 2026)Deep Think (Jul 2025)IMO-ProofBench Advanced~95.1%~90%~65.7%FutureMath Basic (PhD-level)~95.1%~38%—
Autonomous research achievements:
Fully autonomous paper (Feng26) — Generated without human intervention. Calculates eigenweights (structural constants in arithmetic geometry). Submitted to journal for peer review.
Human-AI collaboration papers — Contributed intermediate propositions to two papers (FYZ26 on arithmetic volumes, ACGKMP26 on complexity bounds)
Open problem solving — Evaluated 700 problems from Erdős Conjecture database:
Solved 4 open Erdős problems autonomously
One (Erdős-1051) was generalized into an independent research paper (BKKKZ26)
Human-guided collaboration — LeeSeo26 on independence polynomials
Mathematical Research Autonomy Levels framework introduced:
Level 0: Negligible novelty
Level 1: Minor novelty
Level 2: Publishable quality ← Aletheia’s highest achieved level
Level 3: Major advance (not claimed)
Level 4: Landmark breakthrough (not claimed)
Key insight: Aletheia demonstrates that inference-time scaling laws transfer to PhD-level mathematics, not just Olympiad problems. The FutureMath Basic jump from ~38% (base Deep Think) to ~95.1% (Aletheia) shows agentic workflows dramatically amplify reasoning capability.
Part 3: Accelerating Scientific Research — Cross-Disciplinary Results (DeepMind Paper, Feb 11, 2026)
Paper: “Accelerating Scientific Research with Gemini: Case Studies and Common Techniques” (Woodruff, Cohen-Addad, et al.)
What it is: Collaboration with experts on 18 research problems across algorithms, ML, combinatorial optimization, information theory, economics, and physics.
Key results:
Max-Cut and Steiner Tree — Classic CS problems where progress had stalled. Deep Think broke both deadlocks by applying tools from continuous mathematics (Kirszbraun Theorem, measure theory, Stone-Weierstrass theorem) to discrete algorithmic puzzles. Cross-disciplinary bridge.
Decade-old conjecture disproved — A 2015 conjecture about online submodular optimization was assumed true for 10 years. Deep Think engineered a specific 3-item combinatorial counterexample, rigorously proving human intuition false.
ML optimization proof — Researchers had a technique for automatic noise filtering but couldn’t explain why it worked. Deep Think analyzed the equations and proved the method generates its own “adaptive penalty” internally.
Economic theory extension — A “Revelation Principle” for auctioning AI generation tokens only worked for rational numbers. Deep Think used advanced topology and order theory to extend it to continuous real numbers.
Cosmic string physics — Found novel analytical solution using Gegenbauer polynomials for gravitational radiation integrals containing singularities.
Research methodology insights:
“Advisor” model — Humans guide AI through iterative “Vibe-Proving” cycles
“Balanced prompting” — Requesting simultaneous proof OR refutation to prevent confirmation bias
Code-assisted verification — Using code to check mathematical results
Successfully used for reviewing CS theory papers for STOC’26 conference
Publication trajectory: ~half target strong conferences (including ICLR ‘26 acceptance), remainder for journal submissions.
Part 4: Real-World Applications
Lisa Carbone (Rutgers) — Deep Think reviewed a technical mathematics paper on high-energy physics structures and identified a subtle logical flaw that passed human peer review
3D printing from sketches — Deep Think analyzes drawings, models complex shapes, generates 3D-printable files
Built in partnership with scientists across math, physics, and CS
Part 5: Impact on Predictions
This is a significant development. Here’s why:
1. ARC-AGI-2 at 84.6% redefines the reasoning frontier
ModelARC-AGI-2DateGemini 3 Deep Think84.6%Feb 12, 2026Opus 4.668.8%Feb 5, 2026GPT-5.2 Pro54.2%Dec 2025Human average~60%—Opus 4.537.6%2025GPT-5.117.6%2025
The progression: 17.6% → 37.6% → 54.2% → 68.8% → 84.6% in ~6 months. This is faster than any of our projections. Our Q1 2026 “Accelerating” threshold for ARC-AGI-2 was >68% — Deep Think blows past that to 84.6%. Our Q3 2026 “Accelerating” target was >80%. Google reached Q3 2026 “Accelerating” territory in Q1.
2. AI mathematical research has qualitatively shifted
DateMilestoneSignificanceJan 2026GPT-5.2 Pro solves Erdős #397”Lowest hanging fruit” (Tao)Feb 2026Aletheia writes autonomous publishable paperPublishable-quality autonomous researchFeb 2026Aletheia solves 4 open Erdős problems from 700 evaluatedSystematic survey capabilityFeb 2026Deep Think bridges math/CS fields to break decade-old bottlenecksCross-disciplinary research
We went from “AI can solve easy open problems with standard techniques” (Jan) to “AI can autonomously produce publishable papers and systematically survey hundreds of open problems” (Feb). This is a qualitative leap in just 5 weeks.
3. Inference-time compute scaling is validated as a major capability vector
Deep Think’s architecture (parallel hypothesis exploration, configurable compute budget) shows that you don’t need a bigger model to get dramatically better results — you need smarter inference:
ARC-AGI-2 went from 31.1% (standard Gemini 3 Pro) to 84.6% (Deep Think) — +53.5 points from the same base model
Previous Deep Think scored 45.1% (~Nov 2025); new Deep Think scores 84.6% — accuracy nearly doubled in ~3 months
Deep Think costs $77.16/task at 84.6% — expensive but effective. Open-source Poetiq achieves the same accuracy at $30.57/task
Deep Think uses ~138,000 reasoning tokens per ARC task vs ~96 for standard mode — this extended reasoning is the mechanism behind the +53.5 point gain, at significant compute cost
IMO-ProofBench went from ~65.7% to ~90% with more inference compute
This suggests other models could achieve similar jumps with inference-time scaling
ARC Prize Foundation building ARC-AGI-3 — a fundamentally new interactive/agentic benchmark (agents explore environments, test hypotheses) that complements ARC-AGI-2
4. Google is back in the frontier conversation
Prior to this, Google’s Gemini 3 Pro was competitive but not leading on most benchmarks. Deep Think changes that:
ARC-AGI-2: Google leads (84.6% vs 68.8% Opus 4.6)
Codeforces: Google leads (3455 Elo — Legendary Grandmaster)
Math/Physics Olympiads: Google leads (triple gold)
CMT-Benchmark: Google leads (50.5% — advanced theoretical physics)
HLE (no tools): Google leads (48.4% vs 36.6% GPT-5.2 Pro no-tools)
But Opus 4.6 still leads on: HLE with tools (53.1%), GDPval-AA (1606 Elo), OSWorld (72.7%), BrowseComp (84.0%) GPT-5.3-Codex still leads on: Terminal-Bench (77.3%), Cybersecurity (77.6%)
The race is now three-way at the frontier, not two-way.
5. Caveats — what Deep Think does NOT change
Not a general model — Deep Think is a specialized mode with high latency/cost. You wouldn’t use it for daily coding or chat.
No SWE-Bench data — Practical coding capability unknown. SWE-Bench plateau at ~80-82% is not addressed.
No METR data — Agent autonomy unknown. METR 89-day doubling unaffected.
No OSWorld data — Computer use capability unknown.
API access limited — Early access program only; not generally available to developers.
High cost — $77.16/task for Deep Think at 84.6%. Open-source Poetiq achieves same accuracy at $30.57/task. Still expensive for routine use.
Part 6: Updated Competitive Landscape (Feb 12, 2026)
DomainLeaderScoreRunner-upScoreAbstract Reasoning (ARC-AGI-2)Gemini 3 Deep Think84.6%Opus 4.668.8%Expert Knowledge (HLE, no tools)Gemini 3 Deep Think48.4%GPT-5.2 Pro36.6%Expert Knowledge (HLE, w/ tools)Opus 4.653.1%GPT-5.2 Pro50.0%Professional Work (GDPval-AA)Opus 4.61606 EloGPT-5.21462Coding (SWE-Bench Verified)Sonnet 582.1%Opus 4.680.8%Agentic Coding (Terminal-Bench)GPT-5.3-Codex77.3%Opus 4.665.4%Computer Use (OSWorld)Opus 4.672.7%GPT-5.3-Codex64.7%Competitive Programming (Codeforces)Gemini 3 Deep Think3455 Elo——Math Research (autonomous papers)Gemini Deep Think (Aletheia)PublishableGPT-5.2 ProLow-hanging fruitCybersecurityGPT-5.3-Codex77.6%GPT-5.2-Codex67.4%Agent Autonomy (METR)GPT-5.26.6hOpus 4.5~5.3h
Key insight: No single model/lab dominates. Google leads on pure reasoning and math research. Anthropic leads on professional work, computer use, and broad capability. OpenAI leads on agentic coding and cybersecurity. The frontier is now genuinely three-way.
Part 7: Scenario Probability Updates
CategoryAssessmentEvidenceAbstract reasoning (ARC-AGI-2)ACCELERATING (faster than expected)84.6% blows past Q3 2026 “Accelerating” target (>80%) in Q1AI math researchACCELERATINGAutonomous publishable paper; 4 open problems solved; cross-disciplinary bridgesInference-time compute scalingVALIDATED+53.5 points from same base model (31.1% → 84.6%)Competitive intensityINCREASINGGoogle back in frontier race; now 3-way competitionCoding (SWE-Bench)PLATEAUING (unchanged)No new data from GoogleAgent autonomy (METR)ACCELERATING (unchanged)No new data; 89-day doubling holdsComputer use (OSWorld)AT HUMAN LEVEL (unchanged)No new data
Scenario probability changes:
“Accelerating” scenario: 72% → 74% — ARC-AGI-2 84.6% is ahead of all projections; autonomous math research is a qualitative capability leap; three-way frontier competition intensifies pressure
“Unexpected Acceleration”: 10% → 11% — Inference-time scaling producing +53.5 point gains from same model is a strong signal that recursive reasoning improvements could compound
“Slowdown but Progress”: 13% → 11% — Each week brings more acceleration signals; only SWE-Bench plateau remains as counter-evidence
“Punctuated Equilibrium”: 5% → 4% — Near-zero evidence of plateau
Revised milestone estimates:
MilestonePrevious EstimateRevisedReasonARC-AGI-2 reaches 85%+Q3 2026 (”Accelerating”)ACHIEVED (Feb 12, 2026)Deep Think 84.6% — 6 months earlyAI autonomous publishable math paperNot explicitly predictedACHIEVED (Feb 12, 2026)Aletheia (Feng26)AI systematically surveying open problems2027+ACHIEVED (Feb 12, 2026)700 Erdős problems evaluatedHLE (no tools) reaches 50%Q2-Q3 2026Near-achieved (48.4%)On trackAI contributing to research conferences2027NOW (Feb 2026)STOC’26 paper reviews + ICLR ‘26 acceptance
What to watch next:
Does METR evaluate Gemini 3 Deep Think? (Would be transformative if 89-day doubling applies to Deep Think’s reasoning capacity)
DeepSeek V4 (still expected mid-Feb) — open-source pressure
Does Deep Think mode become generally available via API? (Currently early access only)
How do Anthropic and OpenAI respond? (Competitive pressure on reasoning benchmarks)
Does inference-time compute scaling get adopted by other labs? (Would compress all timelines)
Sources: Google Keyword blog (Feb 12, 2026), Google DeepMind blog (Feb 11, 2026), “Towards Autonomous Mathematics Research” paper (Feng et al.), “Accelerating Scientific Research with Gemini” paper (Woodruff et al.), ARC Prize Foundation verification, Perplexity search
February 5, 2026
Event: Triple release day — Claude Opus 4.6 + GPT-5.3-Codex (same day), Claude Sonnet 5 (Feb 3); Anthropic demonstrates agent teams building C compiler autonomously; METR TH1.1 confirms 89-day doubling
Addendum: Kimi K2.5 (Moonshot AI — Jan 29, 2026)
Added Feb 5; released Jan 29, 2026. Open-source (Modified MIT License).
What it is: 1T parameter MoE model (32B activated, 384 experts, 8 selected per token) from Moonshot AI (Chinese lab). Native multimodal (vision+language pre-training on ~15T mixed tokens atop Kimi-K2-Base). Features “Agent Swarm” — self-directed multi-agent task decomposition. 256K context. Open weights on Hugging Face.
Key benchmarks (Thinking mode, from their paper):
BenchmarkK2.5GPT-5.2 (xhigh)Opus 4.5 (ET)Gemini 3 ProDeepSeek V3.2HLE w/ tools50.245.543.245.840.8AIME 202596.110092.895.093.1GPQA Diamond87.692.487.091.982.4SWE-Bench Verified76.880.080.976.273.1SWE-Bench Pro50.755.655.4——Terminal-Bench 2.050.854.059.354.246.4BrowseComp60.665.837.037.851.4BrowseComp (Agent Swarm)78.4————OCRBench92.380.786.590.3—MathVista (mini)90.182.880.289.8—MathVision84.283.077.186.1—VideoMMMU86.685.984.487.6—SimpleVQA71.255.869.769.7—PaperBench63.563.772.9—47.1
Note: Their comparison is against Opus 4.5, not Opus 4.6. Against Feb 5 Opus 4.6 results (HLE 53.1, ARC-AGI-2 68.8, BrowseComp 84.0), K2.5 doesn’t lead any of our tracked metrics.
Assessment:
Strongest open-source model — surpasses DeepSeek V3.2 across the board (HLE +9.4, GPQA +5.2, SWE-Bench +3.7, Terminal-Bench +4.4)
Multimodal leader — leads or ties on OCRBench (92.3), MathVista (90.1), SimpleVQA (71.2), VideoMMMU (86.6). These are areas where open models now match or beat closed frontier models.
HLE w/ tools 50.2 — beats GPT-5.2 (45.5) and Opus 4.5 (43.2) on hardest knowledge tasks. Behind only Opus 4.6 (53.1). Remarkable for open-source.
Agent Swarm — BrowseComp jumps from 60.6 to 78.4 with multi-agent self-decomposition. Validates the agent teams paradigm from a completely different lab.
Coding still lagging — SWE-Bench 76.8 (vs 82.1), Terminal-Bench 50.8 (vs 77.3). Gap is larger on coding than reasoning.
China dynamics — Moonshot AI (Chinese lab) producing frontier-competitive open models. Supports AI 2027 thesis that China is ~6 months behind on frontier capabilities but closing gap, especially on reasoning/multimodal. Coding gap is wider.
Impact on predictions: Minor. No leaderboard changes. Strengthens competitive intensity thesis and open-source proliferation narrative. Does NOT change scenario probabilities — K2.5 is impressive but doesn’t exceed the frontier set by Opus 4.6 and GPT-5.3-Codex earlier this week.
Part 1: Claude Opus 4.6 (Anthropic — Feb 5, 2026)
Key benchmarks (from Anthropic announcement):
BenchmarkOpus 4.6Opus 4.5GPT-5.2 (best)Change vs Dec baselineSWE-Bench Verified80.8%80.9%80.0%+0.8% (flat)ARC-AGI-268.8%37.6%54.2% (Pro)+14.6%HLE (w/ tools)53.1%43.4%50.0% (Pro)+3.1%GPQA Diamond91.3%87.0%93.2% (Pro)No new bestTerminal-Bench 2.065.4%59.8%64.7% (Codex)New benchmarkOSWorld72.7%66.3%—+6.4% (human ~72%)GDPval-AA (Elo)160614161462+144 EloBrowseComp84.0%67.8%77.9% (Pro)+16.2%T2-bench Telecom99.3%98.2%98.7%Near-saturatedT2-bench Retail91.9%88.9%82.0%+9.9%MRCR v2 8-needle 256K93.0%——New (long-context)MRCR v2 8-needle 1M76.0%——New (1M context)
New capabilities:
1M token context window (beta) — first Opus-class model with this
128K output tokens — enables massive output in single request
Agent teams in Claude Code (research preview) — multiple agents work in parallel on shared codebase
Adaptive thinking — replaces binary extended thinking toggle
Context compaction (beta) — summarizes older context for longer-running tasks
Partner signals:
NBIM: 38/40 wins in cybersecurity investigations (blind ranking)
Harvey: BigLaw Bench 90.2%, 40% perfect scores
Rakuten: Autonomously closed 13 issues and assigned 12 in a single day across 6 repos
Box: 10% performance lift (68% vs 58% baseline)
Part 2: GPT-5.3-Codex (OpenAI — Feb 5, 2026)
Released minutes after Opus 4.6. Not yet available via API (unusual for a major launch — safety concerns cited).
Key benchmarks (all at xhigh reasoning effort):
BenchmarkGPT-5.3-CodexGPT-5.2-CodexGPT-5.2Delta vs 5.2-CodexSWE-Bench Pro56.8%56.4%55.6%+0.4pp (marginal)Terminal-Bench 2.077.3%64.0%62.2%+13.3ppOSWorld-Verified64.7%38.2%37.9%+26.5ppGDPval (wins/ties)70.9%—70.9%MatchedCybersecurity CTF77.6%67.4%67.7%+10.2ppSWE-Lancer IC Diamond81.4%76.0%74.6%+5.4pp
Key firsts:
First OpenAI model classified “High capability” for cybersecurity under Preparedness Framework
First model directly trained to identify software vulnerabilities
“First model that was instrumental in creating itself” — Codex team used early versions to debug training, manage deployment, diagnose evaluations
Co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems
API access withheld — safety precaution (unprecedented for a major launch)
Self-improvement details:
Training debugging: monitored and debugged its own training run
Pattern tracking: deep analysis on interaction quality, proposed fixes
Visualization tools: built applications for researchers to understand behavior differences
Alpha testing analysis: independently created regex classifiers, ran at scale over all session logs, produced report
OpenAI staff: “Many researchers and engineers describing their job today as being fundamentally different from what it was just two months ago”
Part 3: Claude Sonnet 5 (Anthropic — Feb 3, 2026)
Released two days before Opus 4.6.
SWE-Bench: 82.1% (reported) — first model to surpass 80% on SWE-Bench
Internal codename: “Fennec”
Not prominently compared in Opus 4.6 announcement (Opus 4.6 benchmarks compare against Sonnet 4.5)
Part 4: Anthropic C Compiler Agent Teams Demo (Feb 5, 2026)
Published alongside Opus 4.6. Written by Nicholas Carlini (Safeguards team).
Scale:
16 parallel Claude Opus 4.6 agents
~2,000 Claude Code sessions over 2 weeks
~2 billion input tokens, ~140 million output tokens
Total cost: just under $20,000
Output:
100,000 lines of Rust — clean-room implementation (no internet access)
Depends only on Rust standard library
Builds bootable Linux 6.9 on x86, ARM, and RISC-V
Compiles QEMU, FFmpeg, SQLite, PostgreSQL, Redis
99% pass rate on most compiler test suites including GCC torture tests
Can compile and run Doom ✓
Architecture:
Bare git repo as coordination mechanism
Each agent runs in Docker container, pushes/pulls via upstream repo
Task locking via
current_tasks/directory filesNo orchestration agent — each Claude decides “next most obvious” problem
Merge conflicts frequent but handled by agents
GCC used as “oracle” for parallel debugging of Linux kernel compilation
Limitations acknowledged:
No working in-house assembler/linker (still buggy)
Relies on GCC for 16-bit x86 phase (ARM/RISC-V fully self-compiled)
Generated code less efficient than GCC -O0
Rust code quality “reasonable” but not expert-level
Model reached capability ceiling — new features frequently broke existing functionality
Why this matters:
Qualitative leap in autonomous software development — not fixing bugs, building entire systems
Cost comparison — $20K for 100K lines of compiler code vs. team-years of human effort
Agent teams are a new paradigm — parallel agents with simple coordination can produce complex artifacts
Capability benchmark — Carlini explicitly designed this as a stress test; Opus 4.5 could produce functional compiler but couldn’t compile real projects
Ceiling visibility — the limitations section is valuable for calibrating where agents struggle (novel architectures, tight constraints, optimization)
Part 5: METR Time Horizon 1.1 Update
METR released TH1.1 framework with significant improvements. Additionally, METR tweeted a new record:
“We estimate that GPT-5.2 with
high(notxhigh) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.”
Key data:
GPT-5.2 (high effort): 6.6 hours (~396 min) — NEW RECORD (METR tweet, Feb 2026)
95% CI: 3h 20m to 17h 30m
Claude Opus 4.5: 320 minutes (~5h 20m) — revised from ~4h 49m under TH1.0
GPT-5: 214 minutes (~3h 34m)
Task suite expanded: 170 → 228 tasks (34% growth), doubled 8+ hour tasks
Confidence intervals tightened: upper bound for Opus 4.5 narrowed from 4.4x to 2.3x
Critical finding — doubling time:
Overall (2019-present): 196 days
Post-2023: 131 days
Since 2024: 89 days — ~20% faster than previously estimated
Extrapolation at 89-day doubling (from GPT-5.2’s 396 min baseline):
DateProjected Time HorizonHuman EquivalentFeb 2026~6.6 hoursMost of a work dayMay 2026~13.2 hoursFull+ work dayAug 2026~26.4 hoursThree+ work daysNov 2026~52.8 hoursFull work weekFeb 2027~105.6 hoursTwo+ work weeks
Comparison to AI 2027 predictions: AI 2027 assumed 120-day doubling. METR’s 89-day doubling is ~25% faster. At this rate, their March 2027 “superhuman coder” milestone (~60+ hours) would be reached by Q4 2026 — ~2 quarters earlier than they predicted.
Note: GPT-5.2’s 6.6 hours already exceeds our Q1 2026 “On Track” target of 6-12 hours — and we’re only 5 weeks into Q1. This metric is hitting the ACCELERATING zone.
Part 6: Competitive Dynamics
Three major releases in three days:
Feb 3: Claude Sonnet 5
Feb 5: Claude Opus 4.6 + GPT-5.3-Codex (literally minutes apart)
This is unprecedented competitive intensity. Key observations:
Same-day releases — Labs are clearly tracking each other’s launch schedules and racing
API withheld for GPT-5.3-Codex — Safety concerns forcing new release patterns
DeepSeek V4 expected mid-Feb — Open-source pressure incoming
NVIDIA GB200 co-design — Hardware-software co-optimization becoming standard
Self-improvement at both labs — Both Anthropic (agent teams) and OpenAI (Codex self-debugging) publicly demonstrating AI-accelerated R&D
Updated competitive landscape (Feb 5, 2026):
ModelKey StrengthSWE-BenchARC-AGI-2HLE (tools)StatusClaude Opus 4.6Broadest capability leader80.8% (V)68.8%53.1%Released Feb 5GPT-5.3-CodexAgentic/terminal/OSWorld56.8% (Pro)——Released Feb 5 (no API)Claude Sonnet 5Cost-effective coding82.1% (V?)——Released Feb 3GPT-5.2 ProGPQA, math80.0% (V)54.2%50.0%AvailableDeepSeek V4Open-source, 1M+ context———Expected mid-Feb
Part 7: Impact on Predictions
Pattern observed: Capabilities BROADENING, not just deepening on coding
CategoryAssessmentEvidenceCoding (SWE-Bench)PLATEAUINGVerified: 80.0% → 80.8% (+0.8%); Pro: 55.6% → 56.8% (+1.2%)Novel reasoning (ARC-AGI-2)ACCELERATING54.2% → 68.8% (+14.6% in ~2 months)Computer use (OSWorld)ACCELERATING66.3% → 72.7% (matching human ~72%)Agentic tasks (Terminal-Bench)ACCELERATINGGPT-5.3-Codex 77.3% (new SOTA by wide margin)Knowledge work (GDPval)ACCELERATING+144 Elo (1462 → 1606)Information retrieval (BrowseComp)ACCELERATING67.8% → 84.0%CybersecurityACCELERATINGFirst “High” classification; CTF 77.6%Agent autonomy (METR)ACCELERATING89-day doubling (faster than expected)Self-improvementNEWLY VISIBLEBoth labs publicly demonstrating AI-accelerated R&DCompetitive intensityINCREASING3 releases in 3 days; same-day launches
Key insight: SWE-Bench is not the right metric to track anymore. Capabilities are shifting from “fix this Python bug” to “operate a computer like a human.” OSWorld at 72.7% (matching humans) and Terminal-Bench at 77.3% are the new leading indicators.
Scenario probability changes:
“Accelerating” scenario: 70% → 72% — breadth of capability gains, METR 89-day doubling, self-improvement visible
“Unexpected Acceleration”: 10% (unchanged) — SWE-Bench plateau counters; not yet seeing recursive takeoff
“Slowdown but Progress”: 14% → 13% — no slowdown signals except SWE-Bench plateau on one specific benchmark
“Punctuated Equilibrium”: 6% → 5% — no evidence of broader plateau
Revised milestone estimates:
MilestonePrevious EstimateRevisedReasonSWE-Bench Verified “solved” (>97%)Q4 2026H1 2027Plateauing at ~80-82%; may need architectural breakthroughOSWorld human-levelNot predictedACHIEVEDOpus 4.6 at 72.7% vs human ~72%METR ~10 hoursQ1 2026Q1 2026 ✅89-day doubling on trackMETR ~1 weekQ4 2026-Q1 2027Q4 202689-day doubling faster than expectedAI R&D flywheel visible2027NOW (Feb 2026)Both labs publicly confirmingAgent teams building real software2027NOW (Feb 2026)C compiler demo
What to watch next:
DeepSeek V4 (mid-Feb) — open-source competition
METR evaluation on Opus 4.6 and GPT-5.3-Codex — will the 89-day doubling hold with new models?
API release for GPT-5.3-Codex — when/how safety concerns are resolved
Whether SWE-Bench Pro breaks 60% by end of Q1
Sources: Anthropic announcement (anthropic.com/news/claude-opus-4-6), Anthropic engineering blog (anthropic.com/engineering/building-c-compiler), OpenAI blog (openai.com/index/gpt-5-3-codex), GPT-5.3-Codex System Card, METR TH1.1, Perplexity search
January 27, 2026
Event: Clawdbot (open-source personal AI assistant) goes viral — “capability overhang” productization moment
Key Data (as of Jan 27, 2026):
GitHub repo:
clawdbot/clawdbot53,235 GitHub stars, 6,351 forks, 693 open issues
8,172 commits since Initial commit: Nov 24, 2025 (~64 days) → “thousands of commits in ~2 months” is real
Multi-channel “always-on” assistant: WhatsApp/Telegram/Slack/Discord/Signal/iMessage/Teams/etc (per README)
Runs locally (Node ≥ 22) with OAuth to Anthropic/OpenAI; designed for persistent memory + long-running background workflows
Why This Matters:
Unprecedented solo-dev leverage — A single (high-skill) developer + frontier models can now ship and iterate at a scale that historically required full teams. Commit volume isn’t quality, but this is still a real regime change in iteration speed.
“Capability overhang” becomes visible — Even if frontier models plateaued, there is still massive untapped value in packaging: memory, permissions, tool routing, background jobs, multi-channel UX.
Agent deployment is becoming mainstream-adjacent — This is a credible “agent OS” prototype that non-researchers can install today; the adoption curve suggests latent demand for personal agents.
Open-source velocity — A viral, fork-heavy agent platform creates a fast-moving ecosystem of “skills” and integrations, accelerating deployment independent of frontier lab roadmaps.
How This Fits Our Timelines:
Matches AI 2027’s “deployment before magic” flavor: agentic impact can jump from glue code + workflows, not just new model weights.
Supports our 2026 thesis that software delivery speed is a leading indicator: when one person can produce/maintain thousands of changes in weeks, “SWE as a bottleneck” collapses earlier than the macro job-market effects show up.
Does not directly move METR-style autonomy (it’s not a benchmark), but it does demonstrate how current models can sustain long-running, tool-using behavior when wrapped in a persistent system.
Impact on Predictions:
Deployment/productization of agents: ACCELERATING
Personal-workflow shift: ACCELERATING (expect more “personal agent stacks” to go viral in 2026)
Frontier capability timeline: unchanged (this is mostly integration + UX + permissions + persistence)
“Accelerating” scenario probability: unchanged — treat as a strong deployment signal, not a new capability datapoint
Implications for Tracking:
Add a Q1 2026 “agent OS” signal: open-source personal assistants reaching mass adoption (stars/users) without new model releases
Track whether these assistants start reliably completing multi-hour tasks unattended (your “complex features without intervention” test)
Sources: https://github.com/clawdbot/clawdbot (repo stats + commit history checked Jan 27, 2026);
https://docs.clawd.bot
;
https://clawdbot.com
; Perplexity summary notes attributing authorship to Peter Steinberger (@steipete)
January 17, 2026
Event: xAI Colossus 2 operational — World’s first gigawatt-scale AI training cluster
Key Data:
1 GW operational now — First gigawatt AI training cluster in history
1.5 GW by April 2026 — 50% expansion in 3 months
2 GW total capacity planned — Third building (”MACROHARDRR”) acquired
555,000 NVIDIA GPUs — $18B in GPU purchases
Current GPU mix: 150K H100 + 50K H200 + 30K GB200 (+ 110K GB200 coming)
$20B Series E closed January 6, 2026 to fund expansion
Power: Natural gas turbines + 168 Tesla Megapacks (bypassing grid delays)
Scale Comparison (from Epoch AI chart):
FacilityPowerStatusxAI Colossus 21 GW (→1.5 GW April)OperationalAnthropic-Amazon New Carlisle~800 MW → 1.2 GWUnder constructionOpenAI Stargate Abilene~300 MW → 2.2 GWFuture plansLos Angeles (annual avg)~2.4 GWReference
Why This Matters:
First mover at gigawatt scale — xAI beat everyone to 1 GW. This is 4-6x larger than any other operational AI cluster.
Speed of execution — Original Colossus went from conception to 100K GPUs in 122 days. Now scaling to 555K GPUs. This pace is unprecedented.
Bypassing infrastructure bottlenecks — Using portable gas turbines + Tesla Megapacks to avoid multi-year grid connection delays. Other labs are stuck waiting for power.
Grok training implications — This compute advantage should translate to Grok 5/6 being trained on significantly more compute than competitors.
Forcing function for industry — Per Epoch AI: “xAI has forced competitors to abandon cautious incrementalism in favor of ‘superfactory’ deployments”
Insider Details (Sulaiman Ghori Interview — xAI Engineer)
Operational Culture:
“No due dates. It’s always yesterday.”
“No blockers for anything — at least nothing artificial”
“The answer is either ‘no that’s dumb’ or ‘why isn’t it done already?’”
$2.5M value per commit to main repo — engineer did 5 commits in one day
iOS team: 3 people for millions of users
One person rebuilding core production APIs with 20 agents
Model Iteration Speed:
“New model iterations daily, sometimes multiple times a day”
“Within a day of standing up a rack you can usually be training, sometimes within hours”
Working on multiple novel architectures simultaneously
“Not building on any existing body of work — need new pre-training body”
Macro Hard (Human Emulator Project):
Goal: Emulate anything a human does digitally (keyboard/mouse + screen)
“No adoption from any software required — deploy in any situation a human is in”
Scaling plan: “Difference from 1,000 to 1 million human emulators is not very big”
Tesla Computer Network (Major Revelation):
Planning to use Tesla car computers as distributed compute
“4 million Tesla cars in North America, ~half have Hardware 4”
“70-80% of the time they’re sitting idle, probably charging”
“Pay owners to lease time — they get lease paid, we get human emulator running”
“Much more capital efficient than AWS/Oracle VMs or buying Nvidia hardware”
Purely software implementation — no buildout required
Infrastructure War Stories:
Land lease technically “temporary” — used carnival/fair permitting for speed
Tyler won a Cybertruck by getting training run on new GPUs in 24 hours
80+ mobile generators on trucks for power flexibility
Seamless switching between grid/generators/batteries during volatile training runs
Load scales “by megawatts in milliseconds”
Elon’s Direct Involvement:
Makes phone calls to Nvidia → software patches delivered next day
“How can I help? How can I make this faster?” at end of every meeting
Blockers “usually very quickly resolved with one phone call”
Impact on Predictions:
xAI execution speed: EXCEEDING EXPECTATIONS — Daily model iterations, hours to training on new hardware
Macro Hard timeline: ACCELERATING — Tesla network enables massive scale without buildout
“Accelerating” scenario probability: 65% → 68% — Insider details confirm faster-than-expected iteration
Agent deployment: NEW VECTOR — Tesla car network as distributed compute is a game-changer
Key Insight: The Tesla computer network revelation is potentially huge. If xAI can deploy millions of “human emulators” on idle Tesla hardware without building data centers, they have a deployment advantage no one else can match. This is infrastructure arbitrage at scale.
Source: Elon Musk announcement (Jan 17, 2026), Epoch AI, Wikipedia, Sulaiman Ghori interview
Colossus 3 / MACROHARDRR Facility (Aerial Imagery)
Physical Infrastructure (from @SawyerMerritt aerial photo):
FacilitySizeLocationStatusColossus 21M sqftTennessee (north of state line)Operational (1 GW)Colossus 3 / MACROHARDRR800K sqftMississippi (south of state line)Under construction, Feb 2026 opsDedicated Power PlantAdjacent to Colossus 3MississippiUnder construction
Key Observations:
State line split — Facilities deliberately span Tennessee/Mississippi border. Regulatory arbitrage: different permitting, power grids, tax structures, environmental rules.
Dedicated power plant — Not waiting on grid connections. Permanent version of the mobile generator strategy. Signals expectation of sustained high-utilization workloads.
“MACROHARDRR” naming — Building literally named after the product. Suggests Colossus 3 is specifically for Macro Hard inference workloads (human emulators), not just additional training capacity.
Scale: ~1.8M sqft total = ~30 football fields of data center space
What This Means for Macro Hard:
The question was: does planning for 1M human emulators mean the product is ready?
Evidence suggests committed bet, not speculation:
Billions in physical infrastructure specifically for Macro Hard
Dedicated power plant = expecting sustained inference workloads
February 2026 operations = they expect to have something to run
Daily model iterations (per Sully) = capability catching up to infrastructure
This is not “build it and they will come” — this is “build it because we know what’s coming.”
Impact on Predictions:
Macro Hard timeline: More confident — Infrastructure commitment suggests internal confidence in capability
xAI execution: Confirmed — They’re not just talking, they’re pouring concrete
“Accelerating” scenario: 68% (unchanged) — Already factored in, but this is confirming evidence
Source: @SawyerMerritt aerial imagery, Google Maps
Power Supply Chain Analysis & End-2026 Capacity Projection
xAI Power Supply Chain:
SupplierProductOrder SizeStatusTeslaMegapacks (battery storage)$375-400M (~168 units, ~655 MWh)DeployedDoosan Enerbility380 MW gas turbines5 units (1.9 GW total)Ordered Jan 2026, first 2 by end 2026TVAGrid power150 MWConnectedMLGWGrid distributionLocal deliveryConnectedVoltaGridMobile gas turbines (2.5 MW)~35 unitsOperational (some removed)Solar TurbinesSMT-130 (16 MW)MultipleOperationalDuke EnergyPower plant siteAcquiredUnder development
End-2026 Capacity Projection:
SourceCapacityTimelineColossus 2 (current)1 GWJan 2026 ✓Colossus 2 expansion+0.5 GWApril 2026Colossus 3 / MACROHARDRR~0.5-1 GWFeb 2026Doosan turbines (first 2)+760 MWEnd 2026Duke power plantTBD2026Total (base case)~3 GWEnd 2026
What 3 GW Powers:
At ~500W/GPU (GB200 class): ~6M GPU-equivalents theoretical
With overhead/cooling: ~1-1.5 million GPUs realistic
This is ~2x any competitor by end 2026
Competitor Comparison (End 2026):
CompanyCapacityStatusxAI~3 GWOperational + constructionAnthropic/AWS (Rainier)~1.5 GWUnder constructionMicrosoft/OpenAI (Fairwater)~2 GWUnder constructionOpenAI (Stargate)~0.3-0.5 GWEarly construction
Workload Analysis: Why Human Emulators Are the Primary Target
Evidence pointing to Macro Hard as primary workload:
Naming — Colossus 3 literally named “MACROHARDRR”
Dedicated power plant — Built for sustained loads, not bursty training
Scale of buildout — 3 GW only makes economic sense for high-utilization inference
Sully’s comments — “Difference from 1K to 1M emulators is not very big”
Tesla network planning — Backup deployment vector for agents
Workload Economics:
WorkloadPower ProfileUtilizationRevenue ModelTrainingBursty30-50%R&D cost (no direct revenue)Chatbot inferenceScales with usersVariablePer-tokenHuman emulatorsSustained 24/780-95%Per-agent/hour
Revenue Potential (Human Emulators):
Each agent: ~100W inference compute (estimate)
3 GW = capacity for ~30M agent-hours/day
At $5-10/hour (vs $25-50/hour human): $150-300M/day potential
Annual: $50-100B revenue at full utilization
Projected Workload Mix (End 2026):
Workload% of CapacityConfidenceTraining (Grok 5/6, FSD)30-40%HighHuman Emulators (Macro Hard)30-50%Medium-HighGrok inference (API/chat)10-20%HighX/Twitter, other5-10%Medium
Key Insight: The infrastructure decisions only make economic sense if xAI is planning sustained, high-utilization inference workloads — which is exactly what millions of always-on agents require. They’re not building 3 GW for chatbots.
Impact on Predictions:
Macro Hard timeline: HIGH CONFIDENCE — Infrastructure commitment confirms intent
Labor substitution: ACCELERATING — Scale suggests 2026 deployment, not 2027+
“Accelerating” scenario: 68% → 70% — Power supply chain confirms aggressive scaling
Source: Doosan Enerbility order, Tesla Megapack deployments, @SawyerMerritt, supplier announcements
January 11, 2026
Event: GPT-5.2 Pro solves Erdős Problem #397 — proof accepted by Terence Tao
Key Data:
Neel Somani prompted GPT-5.2 Pro to solve Erdős Problem #397 (open since 1980)
Problem: Are there only finitely many solutions to ∏(2mᵢ choose mᵢ) = ∏(2nⱼ choose nⱼ) with distinct mᵢ, nⱼ?
GPT-5.2 Pro generated a negative answer with complete proof
Proof verified by Fields Medalist Terence Tao within hours
Proof formally verified in Lean using Harmonic’s Aristotle AI
Community rapidly extended the result with two-parameter infinite families (SharkyKesa)
The Proof (GPT-5.2 Pro): For any a ≥ 2, if c = 8a² + 8a + 1:
mᵢ = (a, 2a+2, c)
nⱼ = (a+1, 2a, c+1)
This generates infinitely many distinct-index solutions, disproving the conjecture.
Why This Matters:
First AI-generated proof of open Erdős problem accepted by top mathematician — This is a milestone for AI in mathematics
Speed of verification — Submitted ~3:37 AM, Tao verified by ~4:23 AM, Lean formalization by ~4:30 AM
Community amplification — Within hours, mathematicians extended the result with new infinite families
Tao’s assessment: “AI tools are now capable enough to pick off the lowest hanging fruit amongst the problems listed as open in the Erdős problem database”
Formal verification pipeline — GPT-5.2 → Human review → Lean formalization is now a working workflow
Context from Tao:
These are “problems amenable to simple proofs using fairly standard techniques”
GPT-5.2 scores 77% on competition math but only 25% on open-ended research tasks
The 52-point gap shows AI excels at structured problems but struggles with novel mathematical reasoning
This is pattern matching on accessible problems, not genuine mathematical discovery (yet)
Related AI Math Achievements (Late 2025 - Early 2026):
Erdős #728 — GPT-5.2 (January 2026)
Erdős #124 — Harmonic’s Aristotle AI (November 2025)
Erdős #339 — GPT-5 Pro located existing 2003 proof
Impact on Predictions:
AI mathematical capability: ON TRACK — Confirms AI can solve “low-hanging fruit” open problems
Formal verification integration: ACCELERATING — GPT → Lean pipeline working in production
“Accelerating” scenario probability: 62% (unchanged) — This is expected capability at current frontier
Research contribution timeline: ON TRACK — AI contributing to math research as predicted for 2026
What This Doesn’t Mean:
AI is not yet solving genuinely hard open problems (FrontierMath Tier 4 still at 18.8%)
Human mathematicians still needed for problem selection, verification, and extension
The “lowest hanging fruit” framing is important — harder problems remain out of reach
Implications for Tracking:
Add “AI-solved open problems” as a new metric to track
Watch for AI tackling progressively harder Erdős problems
Monitor formal verification tool integration (Lean, Coq, Isabelle)
Source: erdosproblems.com/397, Terence Tao comments, Neel Somani ChatGPT share link
January 7, 2026
Event: NVIDIA announces Vera Rubin NVL72 — already in production
Key Data:
72 Rubin GPUs + 36 Vera CPUs in rack-scale platform
3,600 PFLOPS NVFP4 inference (2x Blackwell)
20.7 TB HBM4 memory @ 1,580 TB/s bandwidth
NVLink 6: 3.6 TB/s per GPU (scale-up), Quantum-X800 InfiniBand / Spectrum-X Ethernet (scale-out)
Training efficiency: 1/4 the GPUs vs Blackwell for same workload
Inference cost: 1/10 per million tokens vs Blackwell
Built on MGX NVL72 rack design — “seamless transition from prior generations”
80+ MGX ecosystem partners
Why This Matters:
10x inference cost reduction — Dramatically accelerates deployment economics. If inference costs drop 10x, agentic AI becomes economically viable for far more use cases.
4x training efficiency — Labs can train larger models or iterate faster with same compute budget. This compresses timelines.
Already in production — Not a 2027 roadmap item. This is shipping now.
HBM4 at scale — 20.7 TB per rack with 1,580 TB/s bandwidth removes memory bottleneck for large context windows and MoE models.
Flywheel confirmation — NVIDIA’s pace (Hopper → Blackwell → Rubin in ~2 years) validates the AI-accelerated infrastructure thesis.
Impact on Predictions:
Infrastructure buildout: ACCELERATING — Rubin arriving faster than expected
Inference cost trajectory: ACCELERATING — 10x cost reduction enables new deployment patterns
“Accelerating” scenario probability: 58% → 62% — Hardware improvements outpacing software in some dimensions
Q1 2026 targets: Remain appropriate — Hardware enables but doesn’t guarantee benchmark progress
Implications for Tracking:
Update infrastructure section with Rubin specs
Watch for Rubin deployment announcements from major labs (AWS, Microsoft, Google, xAI)
Inference cost per token should drop significantly through 2026
Source: NVIDIA GTC announcement, January 7, 2026 (nvidia.com)
December 20, 2025
Event: METR publishes Claude Opus 4.5 time horizon evaluation
Key Data:
50%-time horizon: ~4h 49m (95% CI: 1h 49m–20h 25m) — highest published to date
80%-time horizon: 27m (vs GPT-5.1-Codex-Max’s 32m)
Opus shows “flatter logistic curve” — differentially succeeds on longer tasks
METR Commentary:
Upper CI likely inflated due to limited long tasks in test suite
They’d be surprised if true 50% horizon is 20h+
Working on updating task suite for better long-task coverage
Impact on Predictions:
Status: ON TRACK → SLIGHTLY INCREASED
Validates “accelerating” scenario (now ~58-60% probability)
Q1 2026 target of 6-12hr time horizon remains achievable
No revision needed to existing timeline projections
Source: METR @METR_Evals on X, Dec 19, 2025