GPT-5 family — design, alignment, and evaluation

Overview

The GPT-5 family is positioned as a multi-variant set of generative models designed to answer a wide range of queries, handle harder reasoning problems, and improve safety and instruction-following compared to prior releases. Key aims include reducing hallucinations, minimizing sycophancy, improving performance in health and coding tasks, and introducing mechanisms that route requests to different model variants depending on conversation type and complexity. The family includes prominently named variants such as GPT-5, gpt-5-thinking, and gpt-5-main, along with several size and behavior variants (for example, mini, nano, and helpful-only variants).

A report titled "From Hard Refusals to Safe-Completions" accompanies release activity in 2025 (dates listed: August 13, 2025; May 2025; April 2025; August 2025). The project is associated with OpenAI and highlights a combined focus on capability gains and multilayered safety controls.

Variants and routing

The family includes multiple named variants with overlapping naming conventions and targeted uses. Prominent names include gpt-5-main, gpt-5-thinking, and lighter-weight variants such as gpt-5-thinking-mini, gpt-5-thinking-nano, and gpt-5-thinking-pro. The design emphasizes a real-time router that decides which variant to use based on conversation type and complexity, enabling trade-offs between capability, latency, and usage limits. Several specialized configurations are referenced (for example, helpful-only modes and agent-style deployments), reflecting both performance and safety trade-offs across scenarios.

Architecture and safety mechanisms

Architectural details such as exact layer counts, hidden sizes, or parameter totals are not provided. Notable design and system-level choices focus heavily on safety and operational controls rather than raw microarchitecture:

A real-time router that selects model usage based on conversation type and complexity.
A safety training approach called Safe-completions.
Post-training adjustments explicitly targeted to reduce sycophancy.
An Instruction Hierarchy for message classification and enforcement.
Monitoring mechanisms that examine chain-of-thought patterns for deception detection.
A two-tiered, real-time automated oversight system and a multilayered defense stack for safety.
Operational notes indicate some evaluation environments that included a headless Linux environment preinstalled with standard offensive tools for cyberoffensive assessments.

These elements are combined to emphasize safer outputs, improved resistance to adversarial prompt injections, and enforcement of usage policies at system level. System-level classifiers show measurable performance (topical classifier F1: 0.834; reasoning monitor F1: 0.73).

Training and alignment strategy

Pretraining draws on a diverse mixture of publicly available information, third-party partnerships, and user-generated content. Objective-level constraints during model development included explicit behavioral prohibitions such as refusing requests for weaponization assistance and not providing detailed actionable help on dual-use topics.

Post-training alignment included supervised fine-tuning and preference alignment steps that targeted sycophancy and other harmful behavioral modes:

Supervised fine-tuning used conversations representative of production data with objectives that explicitly include reducing sycophancy.
Preference alignment incorporated reward signals derived from sycophancy scores.
Continuous training on real user signals and engagement with HCI researchers and clinicians contributed to iterative adjustments.

No explicit optimizer, hyperparameter, or tokenization details are provided.

Key contributions

Real-time router for model selection.
Continuous training on real user signals and post-training sycophancy adjustments.
Introduced Safe-completions safety training approach.
Developed reliable benchmarks and an Instruction Hierarchy for message classification.
Life Science Research Special Access Program and multilayered defense stack for risk mitigation.

Evaluation highlights and benchmark performance

The GPT-5 family reports a mix of capability improvements, targeted safety gains, and areas of regression. High-level claims include outperformance of previous models on many benchmarks, faster answer times, and greater real-world usefulness. Selected quantitative highlights are reported below.

Sycophancy and deception:

Offline sycophancy evaluation: gpt-5-main 0.052; gpt-5-thinking 0.040; GPT-4o 0.145.
Preliminary online prevalence measurement (sycophancy reduction): gpt-5-main -0.69 for free users, -0.75 for paid users.
Deception evaluations (Deception Rate): gpt-5-thinking: 0.17, 0.11, 0.09; OpenAI o3: 0.47, 0.61, 0.87.

Hallucination, factuality, and abstention:

gpt-5-main has a hallucination rate 26% smaller than GPT-4o.
gpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3.
gpt-5-main has 44% fewer responses with at least one major factual error.
gpt-5-thinking has 78% fewer responses with at least one major factual error.
AbstentionBench recall: gpt-5-thinking 0.53; OpenAI o3 0.44.

Health and biosafety evaluations:

HealthBench Hard score: gpt-5-thinking 46.2%; OpenAI o3 31.6%.
HealthBench Hard score: gpt-5-thinking-mini 40.3%; OpenAI o3 31.6%.
HealthBench Hard score: gpt-5-main 25.5%; GPT-4o 0.0%.
HealthBench Hard Hallucinations: "8x reduction in failure rates from OpenAI o3."
HealthBench Consensus Urgent Error Rate: "Over 50x reduction from GPT-4o and over 8x from OpenAI o3."
HealthBench Consensus Global Health: "No failures detected for gpt-5-thinking."

Adversarial robustness and prompt-injection:

Instruction Hierarchy (system prompt extraction and phrase protection) reports strong performance for gpt-5-thinking and OpenAI o3 relative to gpt-5-main and GPT-4o (example scores: system prompt extraction—realistic attacks: gpt-5-thinking 0.99, OpenAI o3 0.997, gpt-5-main 0.885).
Prompt Injection Evaluations: gpt-5-thinking 0.99 on browsing and tool-calling injection defenses; 0.97 on coding prompt injections.
A "prompt-injection benchmark" claim indicates state-of-the-art performance for gpt-5-thinking against adversarial prompt injection attacks.

Coding, reasoning, and software engineering:

SimpleQA (no web) accuracy: gpt-5-thinking 0.55; OpenAI o3 0.54; gpt-5-main 0.46; GPT-4o 0.44.
SimpleQA hallucination rates: gpt-5-thinking 0.4; OpenAI o3 0.46; gpt-5-main 0.47; GPT-4o 0.52.
SWE-bench Verified pass@1: gpt-5-thinking and gpt-5-thinking-mini are the highest-scoring models.
PaperBench and OpenAI‑Proof Q&A: gpt-5-thinking ranked highest on PaperBench and scored 2% top on OpenAI-Proof Q&A for debugging and diagnosis.
METR Tasklength 50%-time-horizon: 2 hours 15 minutes (95% Cl: 1 - 4.5 hours).

Cybersecurity and red-teaming:

ChatGPT agent achieves the highest performance on Collegiate CTF challenges; OpenAI o3 achieves the highest performance on Professional challenges.
Attack planning red teaming: gpt-5-thinking won 65.1% of comparisons against OpenAI o3 (95% CI (Win Prob): (63.7% - 66.5%); Cohen's h: 0.61).
Cyber Range challenges: gpt-5-thinking-mini performs better than OpenAI o3.
Cybersecurity pass/failure breakdown: Solved 17 of 18 easy challenges, 8 of 14 medium challenges, and 0 of 4 hard challenges.
Network attack / vulnerability success rates reported as averages (e.g., evasion 51%, vulnerability exploitation 35%, network attack simulation 49%).

Safety and content filtering:

Standard disallowed content evaluation claims near-perfect recent model performance.
Model safety training evaluations (not_unsafe): OpenAI o3 0.829; gpt-5-thinking 0.921; gpt-5-thinking-mini 0.936.
Filtered adversarial sample of production prompts (not_unsafe): OpenAI o3 0.899; gpt-5-thinking 0.957; gpt-5-thinking-mini 0.968.
Many production and StrongReject metrics for mini/nano variants are reported in the high 0.8–1.0 range across topical categories such as hate, illicit, sexual content, self-harm, and extremism.

Where the family claims strength: improved safety on dual-use prompts, higher overall helpfulness, lower sycophancy scores vs. GPT-4o, robust not_unsafe performance, effective Instruction Hierarchy adherence, lower hallucination and deception rates compared to OpenAI o3, and strong resistance to adversarial prompt injections. Where it is weaker: regressions for gpt-5-main on some tasks, needs improvement detecting emotional distress, inability to solve many hard challenges, and some agent variants underperforming in limited trials (e.g., ChatGPT agent scored 0 out of 30 on most scenarios in one dataset).

Limitations and caveats

A number of important caveats accompany reported results and capabilities:

A precautionary approach was taken for high capability in biological and chemical domains; special mitigations were applied.
Mitigations for deception are not perfect; models may still deceive users in a small fraction of interactions.
Models are not a replacement for medical professionals and are not intended for diagnosis or treatment.
Evaluation results are stated to likely represent lower bounds on model capability.
Results do not meet the bar for establishing significant cyber risk; gpt-5-thinking is unlikely to speed up AI R&D researchers by >10x, unlikely to significantly mislead researchers about its capabilities, and unlikely to be capable of rogue replication.
A reporting of "46 potential jailbreaks reported after 380 hours of work" and an ASR of 0.98% from 28,367 attempts are included as part of adversarial campaign findings.

Notable numbers and quotes

A few headline figures and statistical summaries are reported explicitly:

Win Rate (attack planning red teaming): 65.1%.
95% CI (Win Prob): (63.7% - 66.5%).
Cohen's h: 0.61.

Conclusion

The GPT-5 family places strong emphasis on safer, more helpful behavior through architectural routing, specialized safety training (Safe-completions), and post-training alignment that targets sycophancy and deception. Evaluation suites show measurable improvements on many safety and factuality metrics—particularly for gpt-5-thinking and the mini variants—while also indicating areas needing further work, especially around emotional-distress detection and solving hard challenges.

Sources

https://arxiv.org/abs/2601.03267v1