Conversation Summary: AI Model Comparison & Workflow Strategy

Date: May 25, 2026 Context: User evaluating AI coding assistants for product development, PRD creation, spec writing, and implementation.

1. Model Role Classification (The Hierarchy)

Role	Model	Analogy	Strengths
Staff/Principal	Claude Opus 4.7	CPO / Staff Engineer	Creative product thinking, nuanced trade-offs, challenges assumptions, handles ambiguity
Principal contender	Gemini 3.1 Pro	Strong Staff Engineer	Best Claude competitor at this tier, excellent long-context reasoning
Principal contender	GPT-5.5 (Spud)	Staff Engineer (literal)	Strongest execution, slightly less push-back, 72% fewer output tokens
Senior Engineer	DeepSeek V4	Reliable Senior	Precise, thorough, follows clear specs, great for execution
Value / 80% tasks	GLM-5.1 (Z.ai)	Mid-level with potential	~94% of Claude coding at 1/10th the cost, $3-49/mo flat rate
Budget throughput	MiniMax M2.7	Junior high-volume	Fast & cheap (10B active params), not for strategic decisions

2. The Prompt vs Model Breakdown

Why Claude feels smarter than it might actually be:

~60% of perceived quality = system prompt engineering (hidden, co-optimized with training)
~40% = model weights/architecture (base capability)
Claude's system prompt is proprietary, changes frequently, uses hidden special tokens

The co-optimization problem:

Claude's prompt is finely tuned to exploit how Claude was RLHF-trained
The same prompt on DeepSeek won't produce Claude-like behavior — different training data, reward model, architecture
Community attempts to replicate it only close ~10-15% of the gap, not 60%
On minimal-prompt tools (pi.dev, opencode default), the gap between models narrows significantly (~80-90% of gap closes)

What the "brain" of an LLM actually is:

Model weights = the actual brain (billions/trillions of learned parameters in a file)
Vector DB = NOT the brain — external memory for RAG/retrieval
Training data = NOT stored in the model — only compressed patterns from it
Architecture matters: attention mechanisms, layer counts, activation functions, MoE routing all differ between models

3. Benchmarks (May 2026)

Benchmark	GPT-5.5	Claude Opus 4.7	DeepSeek V4	GLM-5.1	MiniMax M2.7
SWE-bench Verified	88.7%	87.6%	~81%	77.8%	~78%
SWE-bench Pro	58.6%	64.3%	—	—	56.2%
Terminal-Bench 2.0	82.7%	69.4%	—	—	57%
Token efficiency	72% fewer output	Baseline	35-100x cheaper	~7x cheaper	~20x cheaper
Price (input/M)	$5	$5	$0.14-0.30	~$0.40	~$0.30

Key nuance: SWE-bench Pro (harder multi-file reasoning) still favors Claude. GPT-5.5 wins on terminal coding & token efficiency.

4. Detailed Model Profiles

Claude Opus 4.7

Released April 16, 2026
Strengths: Ambiguous problems, architectural decisions, creative reframing, product thinking
Weaknesses: Expensive ($5/$25 per M tokens), verbose output (72% more tokens than GPT-5.5)
Best for: Principal-level work, novel bug investigation, PRD creation
Subscription: $20-200/mo (usage-capped)

GPT-5.5 (Spud)

Released April 23, 2026 — first full retrain since GPT-4.5
Strengths: Token efficiency, terminal coding, well-defined tasks, Codex CLI with 8 parallel subagents
Weaknesses: More literal, less push-back on ambiguous requirements
Best for: Heavy implementation, CI/CD work, sandboxed execution
Codex CLI: Open source (Apache 2.0), Rust-based, sandboxed execution

DeepSeek V4

Key trait: "Technician not C-level" — excellent at executing clear specs, weaker at shaping vague ideas
Bug fixing: Great for reproducible bugs with clear specs; weaker on mysterious intermittent edge cases
CI/CD: Handles 80% of real-world CI failures (wrong versions, config mismatches, known patterns)
Weak for: The 20% requiring detective work across scattered context
Pricing: Free web chat (unlimited), 5M free API tokens (one-time), then $0.14-0.50/M tokens
Architecture: MoE, trained on Huawei Ascend chips, 1M context window

GLM-5.1 (Z.ai / Zhipu AI)

744B total params, 40B active (MoE), 256 experts
Trained entirely on 100,000 Huawei Ascend 910B chips (zero NVIDIA dependency)
MIT license (open weights)
~94% of Claude Opus 4.6 coding performance per self-reported benchmarks
Coding plan: $3-49/mo flat subscription (not pure token-based)
Anthropic-compatible endpoint — drop into Claude Code CLI with zero rewrite
Best for: Cost-sensitive teams needing Claude-like quality

MiniMax M2.7

230B total params, only 10B active per token (extremely efficient)
~78% SWE-bench Verified
~100+ TPS throughput (3x Claude speed)
$0.30/$1.20 per M tokens (~20x cheaper than Claude)
Open weights available
Best for: High-volume routine tasks, not principal-level decisions

5. Bug Fixing: Where Each Model Excels

Bug Type	Best Model	Why
Reproducible bug ("function X returns NaN when Y is negative")	DeepSeek	Pattern-based, in training data, clear specs
Ambiguous bug ("payment flow sometimes fails in prod, can't reproduce")	Claude	Hypothesis generation, connecting scattered clues
CI/CD: Wrong pnpm/node version	Either	Well-documented pattern
CI/CD: Prisma migration in Docker	Either	Known patterns
CI/CD: Race condition between cache layers	Claude	Requires detective work across logs, configs, docs

6. Hybrid Workflow: Claude + OpenCode (DeepSeek) in Same Project

The Strategy

Claude (Staff/Principal) --explores--> Architecture & PRD
                                          |
                                          v
                                  AGENTS.md / CLAUDE.md
                                  (captures decisions)
                                          |
                                          v
OpenCode/DeepSeek (Senior) --executes--> Implementation
                                          |
                                          v
                                  Claude (standby) --escalation--> Unresolved tasks

How It Works in Practice

Claude handles: Research, architecture design, PRD writing, establishing patterns, ambiguous problems
Document decisions: Both CLAUDE.md (Claude's conventions) and AGENTS.md (opencode instructions) coexist in the project root. Or alias one to the other via opencode config.
OpenCode/DeepSeek handles: Implementation against those patterns, feature building, bug fixes (80% of routine ones), CI/CD troubleshooting
Escalation path: If DeepSeek hits something uncertain, tag Claude

Requirements for Success

AGENTS.md must be thorough — Claude's decisions, patterns, conventions, architectural rationale documented
Review output — since DeepSeek won't push back on ambiguous specs
Claude quota savings: ~60-70% reduction for same throughput
Risk: DeepSeek assumes you meant what you said rather than probing for unspoken intent

Concrete Example

Claude explores auth flow architecture, writes spec with trade-off analysis
Claude's decisions recorded in AGENTS.md (chosen approach, rejected alternatives, conventions)
DeepSeek implements: "Add JWT auth following the pattern in AGENTS.md section 3"
If Prisma migration fails in CI, DeepSeek reads the error, checks AGENTS.md for DB conventions, fixes it

7. Pricing Models Explained

Token-Based (Pay-as-you-go)

How it works: Pay per million tokens input/output
Pros: Flexible, pay for what you use
Cons: Unpredictable cost, risk of bill shock
Examples: OpenAI API, Anthropic API, DeepSeek API, Google API

Subscription (Flat Rate)

How it works: Fixed monthly price with usage caps
Pros: Predictable, no overspending anxiety
Cons: Caps limit heavy usage, unused quota wasted
Examples:
- Claude Pro/Max: $20-200/mo
- GLM Coding Plan: $3-49/mo
- MiniMax: $10-50/mo
- ChatGPT Plus: $20/mo

Reality Check

Both models are effectively the same — subscriptions are just prepaid token buckets with hidden caps ("X messages per 5 hours"). The cap protects the provider from losing money on heavy users.

Why Token-Based Dominates

High variance in user usage patterns
GPU compute is expensive — providers can't absorb heavy users at flat rates
Most providers offer BOTH: subscription for casual users, API for heavy users

Recommendation for Budget-Conscious

GLM Coding Plan ($3-49/mo) = best Claude alternative with predictable monthly cost
DeepSeek API ($0.14-0.50/M tokens) = cheapest for high-volume API usage
Claude Max ($20-200/mo) = best quality but expensive

8. OpenCode-Specific Notes

OpenCode supports 75+ providers including free models
Free tier varies by provider and region (typically daily-reset caps)
Supports AGENTS.md for project-level instructions
Runs in terminal, desktop app, or IDE extension
LSP-enabled for better code understanding

9. Key Takeaways

No single best model — they excel at different roles in the workflow
60% of Claude's magic is prompt, not model — but you can't replicate it on other models
Claude explores, DeepSeek exploits — hybrid workflow saves 60-70% Claude quota
Document decisions in AGENTS.md/CLAUDE.md for smooth handoff between models
GPT-5.5 is the closest Claude competitor at principal level, Gemini 3.1 Pro is second
GLM-5.1 is the best value subscription for Claude-like quality
DeepSeek = technician (senior engineer), Claude = principal (staff/CPO)
Fixed pricing (subscriptions) and token-based pricing are effectively the same — just different packaging