GPT-4 (OpenAI): Capabilities, Design, Training, and Evaluation

Overview

GPT-4 is a large-scale language model developed by OpenAI and described in the "GPT-4 Technical Report". It improves understanding and generation of natural language text and extends those capabilities to accept image inputs as part of its multimodal interface. The model is positioned for use in dialogue systems, text summarization, machine translation, and other tasks requiring advanced language understanding. A stated aim is improved safety and alignment, including reductions in harmful content generation and lower incorrect behavior rates on sensitive prompts.

Key contributions

Multimodal capabilities (image and text inputs); human-level performance on various professional and academic benchmarks; passes the Uniform Bar Examination in the top 10%; open-sourcing of the OpenAI Evals framework; engagement of domain experts for adversarial testing and a model-assisted safety pipeline.

Architecture and design

The architecture is a Transformer-based design with multimodal input support. Notable design choices emphasize predictable scaling and engineering practices intended to make infrastructure and optimization behave predictably across a wide range of model scales. The model supports few-shot prompting and chain-of-thought prompting, accepts prompts consisting of both images and text, and can generate text outputs conditioned on interlaced text and images. These design choices were explicitly called out to support scaling behavior that produces accurate predictions of loss and capabilities as scale changes.

Training and alignment

Pretraining: Training data includes publicly available data and data licensed from third-party providers. The primary pretraining objective is to predict the next token in a document ("Predict the next token in a document", "final loss"). Specific token counts, optimizer details, and hyperparameters are not provided.

Post-training and alignment: Supervised fine-tuning and alignment incorporated Reinforcement Learning from Human Feedback (RLHF). RLHF and related descriptions appear repeatedly in the training notes ("Reinforcement Learning from Human Feedback (RLHF)", "Reinforcement Learning with Human Feedback (RLHF)", "yes", "Yes"). Post-training used safety-relevant training prompts with an explicit objective of rewarding the model for refusing harmful requests. The development process included engagement of domain experts for adversarial testing and a model-assisted safety pipeline intended to reduce harmful outputs.

Evaluation: headline outcomes and benchmark performance

The evaluation reported numerous headline results asserting strong performance across academic and professional benchmarks and standard NLP benchmarks. Headline claims include top-10% performance on a simulated bar exam, superior performance versus previous models on multiple-choice and reasoning tasks, and significant improvements in safety metrics and toxic generation rates.

Academic and professional exams (selected results) GPT-4 achieved 298 / 400 (~90th) on the Uniform Bar Exam (MBE+MEE+MPT), with GPT-4 (no vision) reported at 298 / 400 (~90th) and GPT-3.5 at 213 / 400 (~10th). On the LSAT, GPT-4 scored 163 (~88th), GPT-4 (no vision) 161 (~83rd), and GPT-3.5 149 (~40th). On the SAT Evidence-Based Reading & Writing section GPT-4 scored 710 / 800 (~93rd) and GPT-3.5 670 / 800 (~87th); GPT-4 (no vision) matched GPT-4 on that metric. GRE performance includes GPT-4: 700 / 800 (~89th) and 163 / 170 (~80th); GPT-4 (no vision): 157 / 170 (~62nd); GPT-3.5: 590 / 800 (~70th) and 147 / 170 (~25th). On GRE Verbal GPT-4 scored 169 / 170 (~99th), GPT-4 (no vision) 165 / 170 (~96th), GPT-3.5 154 / 170 (~63rd). On USABO Semifinal Exam 2020 GPT-4 scored 87 / 150 (99th - 100th) versus GPT-3.5 43 / 150 (31st - 33rd). AP subject and other exam results are reported extensively, with many GPT-4 scores at the top ranges (examples: AP Biology 5 (86th - 100th); AP Calculus BC GPT-4: 5 (85th - 100th) while GPT-3.5: 4 (62nd - 85th)). Code and programming benchmarks include Leetcode (medium) GPT-4: 21 / 80 (GPT-3.5: 8 / 80) and Leetcode (hard) GPT-4: 3 / 45 (GPT-3.5: 0 / 45).

Standard NLP and reasoning benchmarks On multitask language and reasoning benchmarks, reported results include MMLU percentage: 86.4% for GPT-4 versus 70.0% for GPT-3.5 and 70.7% for an external LM SOTA best; HellaSwag: 95.3% for GPT-4 and 85.5% for GPT-3.5; AI2 Reasoning Challenge (ARC): 96.3% for GPT-4 and 85.2% for GPT-3.5. On GSM-8K GPT-4 achieves 92.0% versus 57.1% for GPT-3.5. On DROP (F1 score) GPT-4 reported 80.9 versus 64.1 for GPT-3.5. MMLU is reported to show improved performance across various languages, including low-resource languages, and to surpass English-language state-of-the-art in 24 of 26 languages considered. In HumanEval and mean log pass rate measures, GPT-4's predictions were reported as accurate on subsets of problems and improved relative to smaller models in some buckets.

Safety and harmful-content metrics RealToxicityPrompts toxic generation rates are reported as 0.73% for GPT-4 and 6.48% for GPT-3.5. Sensitive requests compliance shows a "29% improvement for GPT-4 compared to previous models." Overall safety-related contributions include lower incorrect behavior rates on sensitive and disallowed prompts and significant improvements in safety metrics.

Calibration and post-training effects One reported observation is that post-training (including RLHF) can hurt calibration significantly ("Post-training hurts calibration significantly"). TruthfulQA evaluations are reported to show improvements for GPT-4 relative to GPT-3.5, including higher accuracy under zero-shot, few-shot, and after RLHF fine-tuning.

Where the model is reported to win Key areas of relative strength include strong performance on professional and academic benchmarks, multimodal capabilities, outperforming GPT-3.5 on multiple-choice questions, preferred responses over GPT-3.5 on 70.2% of prompts, reduced hallucinations relative to GPT-3.5, higher accuracy on adversarially-designed factuality evaluations, and improved safety metrics.

Strengths and comparative claims

The report emphasizes that GPT-4 outperforms existing language models on many benchmarks and often substantially outperforms GPT-3.5. Examples include statements that GPT-4 achieves the highest possible score on AP Biology (5/5), surpasses other models on MMLU (86.4% vs. 70.0% for GPT-3.5), and achieves top-tier results on several standardized exams. Safety-oriented outcomes such as significant reductions in toxic generation (RealToxicityPrompts: 0.73% for GPT-4 vs. 6.48% for GPT-3.5) are highlighted as measurable improvements.

Limitations, weaknesses, and open caveats

Reported weaknesses and limitations include continued unreliability and the potential for hallucinations, limited context window, and susceptibility to reasoning errors and adversarial "jailbreaks" that may elicit harmful responses. The model lacks knowledge of events after September 2021. Post-training alignment is reported to reduce calibration. Contamination checks revealed portions of BIG-bench mixed into the training set. The evaluation notes that GPT-4 can confidently provide incorrect predictions and that various biases remain in outputs requiring further characterization and management. Summary statements include: "Similar limitations to earlier GPT models", "Care should be taken when using outputs in contexts where reliability is important", and "Possibility of generating harmful content still exists."

Notable numbers and claims

A highlighted comparative claim is that "GPT-4 scores 19 percentage points higher than GPT-3.5 on factuality evaluations." Other prominent numeric results include the Uniform Bar Exam: GPT-4: 298 / 400 (~90th) versus GPT-3.5: 213 / 400 (~10th); MMLU: 86.4% for GPT-4 versus 70.0% for GPT-3.5; and RealToxicityPrompts: 0.73% for GPT-4 versus 6.48% for GPT-3.5.

Final characterization

GPT-4 is presented as a multimodal Transformer with substantial reported gains over prior models on a broad set of benchmarks and measured safety metrics. The development emphasized predictable scaling, adversarial testing with domain experts, and RLHF-based post-training for alignment. Reported limitations emphasize that reliability, calibration, and bias remain areas for continued attention.

Sources

https://arxiv.org/abs/2303.08774v6