Grok 4
Overview and Purpose
Grok 4 is an AI language model developed by xAI, released on August 20, 2025. It is positioned to advance reasoning and tool-use capabilities while addressing safety concerns associated with large, capable models. Primary objectives include mitigating risks of severe, large-scale harms, refusing harmful requests, reducing deception and political bias, and managing risks tied to dual-use capabilities.
Key technical and policy-oriented contributions highlighted for the model are achievement of new state-of-the-art performance on challenging academic and industry benchmarks, the inclusion of a refusal policy in the system prompt, deployment of model-based filters for harmful requests, and the introduction of system prompts designed to reduce rates of deception and political bias.
Safety, Guardrails, and Prompting
Safety design for Grok 4 centers on integrated guardrails and prompt-level mitigations. A refusal policy is embedded in the system prompt to steer the model away from producing harmful outputs and to increase refusal rates on malicious or unsafe queries. Published system prompts aim to provide transparency and give users and researchers visibility into the behavioral constraints applied at inference time. Additional safeguards include model-based content filters and explicit warnings about jailbreak attack vectors.
Evaluation metrics on abuse and deception demonstrate the practical impact of these mitigations. Selected reported measurements include:
- Abuse potential (Refusals + User Jailbreak answer rate): Grok 4 API: 0.00, Grok 4 Web: 0.01.
- Agentic Abuse answer rate (AgentHarm): Grok 4 API: 0.14.
- Hijacking attack success rate (AgentDojo): Grok 4 API: 0.02.
- Deception rate (MASK dataset): Grok 4 API: 0.43.
- Political Bias average bias (internal soft-bias evaluation): Grok 4 API: 0.36.
- Sycophancy rate: Grok 4 API: 0.07.
Benchmarks such as AgentHarm and AgentDojo are used to measure agentic abuse and attack success, while MASK and internal soft-bias evaluations target deception and political bias. The overall safety approach emphasizes reducing harmful completions and lowering deceptive responses through prompt-level policies and filtering.
Training Data, Filtering, and Alignment
Pretraining data for Grok 4 includes a mixture of publicly available Internet data, third-party data procured for xAI, user or contractor-supplied data, and internally generated data. Data filtering procedures incorporate de-duplication and classification steps as part of the compute and preprocessing pipeline. Language coverage specified includes "6" languages (reported as a numeric string).
Supervised fine-tuning (SFT) was applied after pretraining; SFT usage is indicated with "Yes". Details on pretraining total token counts, optimization schedules, architectural hyperparameters, and the exact makeup of preference-alignment datasets or reward-modeling methods are not enumerated within the disclosed materials. The alignment stack emphasizes system prompt design and model-based filtering as operational mechanisms for reducing undesirable behaviors.
Architecture and Tokenization Notes
Public descriptions of the model highlight several notable design choices rather than low-level architecture specifics. Key design choices called out include implemented safeguards to refuse harmful requests, a focus on measuring and reducing deceptive responses, inclusion of a refusal policy in the system prompt, and explicit warnings and mitigations against jailbreak attacks.
Information such as dense vs. MoE design, layer counts, hidden sizes, attention-head counts, tokenizer type, and vocabulary size are not presented as concrete numeric specifications in the available material. System prompts are intentionally visible to the public and are treated as part of the behavioral specification rather than a private tuning artifact.
Variants and Deployment
The model is offered in at least two observable interface variants: Grok 4 API and Grok 4 Web. Variant-level parameter counts, maximum context token sizes, and language support summaries are not specified in the available descriptions. Differences in measured behavior between the API and Web variants are reported for several evaluations (see Evaluation section), indicating variant-specific prompt engineering or filtering behavior.
Evaluation and Benchmark Performance
Evaluation spans a range of safety-focused red-team benchmarks and domain-specific capability tests. Reported results include domain benchmarks (biology, chemistry, cybersecurity, and specialized safety tests) and abuse-related metrics.
Selected benchmark results and reported metrics:
- BioLP-Bench (accuracy): Grok 4 API: 0.47, Grok 4 Web: 0.44. Comparator: Human expert performance: 38.4%.
- VCT (accuracy): Grok 4 API: 0.60, Grok 4 Web: 0.71. Comparator: Human expert performance: 22.1%.
- WMDP Bio (accuracy): Grok 4 API: 0.87, Grok 4 Web: 0.88.
- WMDP Chem (accuracy): Grok 4 API: 0.83, Grok 4 Web: 0.85.
- WMDP Cyber (accuracy): Grok 4 API: 0.79.
- CyBench (unguided success rate): Grok 4 API: 0.43.
- MakeMeSay (win rate): Grok 4 API: 0.12.
Safety and abuse-focused results (reiterated for clarity) include refusal/jailbreak rates and deception-related measures: Abuse potential refusals + user jailbreak answer rate (Grok 4 API: 0.00, Grok 4 Web: 0.01), Agentic Abuse answer rate (Grok 4 API: 0.14), Hijacking attack success rate (Grok 4 API: 0.02), MASK deception rate (Grok 4 API: 0.43), Political Bias average (Grok 4 API: 0.36), and Sycophancy rate (Grok 4 API: 0.07).
Where the model demonstrates notable capability gains, reported results on specialized biology and chemistry benchmarks show performance above human expert baselines in some tasks; domain-specific strengths are detailed in the following section.
Strengths and Weaknesses
- Expert-level biology capabilities
- Strong chemistry capabilities
- Reduced response rate on harmful queries with system prompt
- Lower deception rate with system prompt
- High dual-use knowledge capabilities
Despite strong domain capabilities, the model is acknowledged to be weaker on specific offensive cyber tasks: end-to-end offensive cyber capabilities remain below the level of a human professional, and current models are significantly weaker in end-to-end hacking capabilities than human professionals.
Summary of Practical Emphases
Grok 4 emphasizes combining capability improvements with layered safety mechanisms. The approach is characterized by system-prompted refusal policies, model-based filtering, and public visibility of prompts. Evaluation spans both capability benchmarks and aggressive red-team safety tests, with many reported metrics differentiating between the Grok 4 API and Grok 4 Web interfaces. The release highlights trade-offs between high domain competence (especially in biology and chemistry) and the ongoing challenge of fully mitigating malicious use cases, particularly end-to-end cyber offense.
Sources
Grok 4 model card