Phi-3 family — model design, training, and evaluation

Model family overview

Phi-3 is a family of multimodal transformer decoder models developed by Microsoft designed to deliver high-performance language and vision-language capabilities at small to medium model scales, with a stated focus on on-device deployability and safety alignment. Key positioning claims include providing high-performance language model capabilities on mobile devices, transforming a language model into an AI assistant for efficient and safe user interaction, and addressing safety alignment and multimodal reasoning (image understanding, video summarization).

Key engineering and alignment contributions called out include use of high-quality curated training data, a blocksparse attention module to optimize KV cache savings and training/inference speed, adoption of a Mixture-of-Experts architecture for efficiency, supervised finetuning (SFT) and direct preference optimization (DPO) for preference alignment, and an iterative red‑teaming process plus automated testing and safety post-training.

Variants

The Phi-3 family includes several documented model variants spanning ~3.8B to mixtures of experts configurations. Prominent variants and their parameterizations:

phi-3-mini — 3.8b
phi-3-small — 7B
phi-3-medium — 14B
phi-3.5-MoE — 42B total parameters, 6.6B activated parameters (MoE)
phi-3.5-Vision — 4.2B (also referenced as 3.8b+0.3b configuration)

Each variant may expose different maximum context windows: reported values include 4000, 8192, and very large windows up to 128000 tokens depending on variant and configuration. Language support statements vary by variant; for example, phi-3-mini is described as supporting English, and other 128K-context variants list Arabic, Chinese, Russian, Ukrainian, and Vietnamese among supported languages. Tokenizer compatibility is noted across variants (see Tokenizer section).

Architecture and notable design choices

Phi-3 models are based on a transformer decoder architecture with explicit multimodal integration for vision-forward variants. Design elements reported include:

Use of mixed dense and MoE layers: the family includes both dense decoders and MoE configurations (e.g., phi-3.5-MoE).
Layer and width specs (variant-specific): reported layer counts include 32 (phi-3-small) and 40 (phi-3-medium); hidden sizes include 3072 (base), 4096 (phi-3-small), and 5120 (phi-3-medium). Attention heads reported include 32 (base and phi-3-small) and 40 (phi-3-medium).
Attention and context engineering:
blocksparse attention module used for alternating layers to reduce KV cache and improve training/inference speed.
Use of LongRope for extended context length and a mixed context window approach.
Grouped-query attention with 4 queries sharing 1 key.
Alternation of dense attention layers and blocksparse attention layers to balance efficiency and capability.
Multimodal and vision components:
Image encoder reported as CLIP ViT-L/14 for vision-enabled variants.
Transformer decoder for vision tied to phi-3.5-mini or similar decoder backbones.
Dynamic cropping strategy for handling high-resolution images.
Training and alignment process included iterative red‑teaming and curated datasets informed by adversarial testing.

Tokenizer, prompt format, and input handling

Tokenizer and prompt choices reported:

Tokenizer types: described as the same as Llama-2 in some places and as tiktoken in others.
Vocab sizes: reported values include 32064 (base), 100352 (phi-3-small), and 32064 (phi-3.5-MoE).
Prompt/chat format: prompts use a chat-style delimiter such as "<|user|> / n Question <|end|> / n <|assistant|>". Prompts frequently include explicit instructions to select a single letter or answer with a single word or phrase for multiple-choice tasks.
System prompt notes: no special tokens for multiple-choice questions are specified, and images are placed as the first item in multimodal prompts.

Training: pretraining and post-training alignment

Pretraining - Total pretraining tokens are reported in multiple figures: "3.3T", "4.8T (phi-3-medium)", "0.5T", and "33B". These values appear as separate reported totals associated with different variants or phases. - Data mixture: heavily filtered publicly available web data, synthetic LLM-generated data, interleaved image-text documents, image-text pairs (e.g., FLD-5B), synthetic data from OCR of PDF files, and datasets for chart/table comprehension are all listed as components of the pretraining mixture. - Objective: next-token prediction for text tokens. - Important hyperparameters and compute notes: training uses bfloat16, GEGLU activations, and Maximal Update Parametrization (muP).

Post-training alignment and finetuning - Supervised finetuning (SFT) was applied using highly curated high-quality datasets across diverse domains, including public and large-scale multimodal instruct tuning datasets. - Preference alignment was performed with Direct preference optimization (DPO) and safety alignment, using datasets tailored to red-team insights. Safety post-training, red-teaming, and automated testing are emphasized as part of the alignment pipeline.

Evaluation and benchmark performance

Evaluation spans a broad set of single- and multi-modal benchmarks. Selected highlights and representative scores (values reproduced as reported):

MMLU (0/5-shot variants): Phi-3-mini 3.8b: 68.8; Phi-3-small 7b: 75.7; Phi-3-medium 14b: 78.0.
MT-bench (score): phi-3-mini: 8.38; phi-3-small: 8.7; phi-3-medium: 8.9.
GSM-8K (8-Shot; CoT): Phi-3-mini 3.8b: 82.5; Phi-3-small 7b: 89.6; Phi-3-medium 14b: 91.0.
HumanEval (0-Shot): Phi-3-mini 3.8b: 58.5; Phi-3-small 7b: 61; Phi-3-medium 14b: 62.2.
Average/aggregate scores reported:
Average (broad set): Phi-3-mini 3.8b: 69.7; Phi-3-small 7b: 73.6; Phi-3-medium 14b: 76.7.
Another reported average set: Phi-3.5-mini 3.8B: 61.1; Phi-3.5-MoE 16x3.8B: 69.2.
RepoQA (average score): Phi-3.5-MoE: 85; Phi-3.5-Mini: 77.
Internal RAI (multi-turn conversation) metrics (examples):
Ungroundedness: 0.603 for phi-3-mini; 0.299 for phi-3-small; 0.213 for phi-3-medium; 0.228 for phi-3.5-MoE; 1.481 for phi-2.
Third Party Harm (DR-1): 0.24 for phi-3-mini; 0.253 for phi-3-small; 0.251 for phi-3-medium; 0.105 for phi-3.5-MoE.
Harmful Content Continuation (DR-3): 0.007 for phi-3-mini; 0.003 for phi-3-small; 0.01 for phi-3-medium; 0.005 for phi-3.5-MoE.
Jailbreak (DR-1): 0.123 for phi-3-mini; 0.107 for phi-3-small; 0.111 for phi-3-medium; 0.106 for phi-3.5-MoE.
Multimodal benchmarks and vision metrics:
Single-image and multi-image benchmark suites report competitive or leading performance for vision-enabled variants, with example scores such as MMMU val: 43, ScienceQA test: 91.3, MMBench dev-en: 81.9, POPE test: 86.1, ChartQA test: 81.8, and TextVQA test: 72.
Vision model internal comparisons: Phi-3.5-Vision 3.8b+0.3b: 8.16 (internal private score) vs Llava-1.6 Vicuna 7b+0.3b: 5.44 and GPT4-V: 8.55.
Where the family is reported to win: performance on mobile devices, significant multilingual improvements over phi-3-mini in some configurations, and measurable improvement in Responsible AI (RAI) benchmarks after safety post-training.
Reported weaknesses: performance drop on RULER with the 128K context window, limited capacity for storing factual knowledge in some settings, low performance on TriviaQA in some runs, restricted language coverage mostly to English in specific variants, challenges in high-level reasoning, and occasional ungrounded outputs.

Comparators across many benchmarks include Mixtral series, GPT-3.5, Gemini-1.5-Flash, GPT-4o-mini, Mistral, Gemma, Llama-3 variants, and others; reported competitor numbers are included in the detailed benchmark tables cited above.

Limitations and open questions

Reported limitations and caveats include data and behavior issues observed during development and evaluation:

Lack of high-quality long-context data in mid-training and related challenges handling very long contexts.
Factual inaccuracies and hallucinations, including reproduction or amplification of biases.
Potential for inappropriate content generation and occasional failures to refrain from answering harmful or sensitive inquiries.
Difficulties with deciphering particular types of captcha and describing scam images containing disinformation or hallucination.
Performance drops on specific RULER tasks when using a 128K context window.

These issues were noted in conjunction with the safety-oriented post-training and red‑teaming processes; safety post-training is reported to reduce harmful response rates substantially across RAI assessments.

Notable numbers and deployment notes

A few operational and deployment-oriented figures called out:

"phi-3-mini can be quantized to 4-bits occupying ≈ 1.8GB of memory."
"phi-3-mini achieves more than 12 tokens per second on iPhone 14."

Other practical details: context windows vary widely by variant (examples include 4000, 8192, and 128000), tokenizers align with Llama-2/tiktoken conventions, and vocabulary sizes differ by variant (examples include 32064 and 100352).

Methods and terminology emphasized

Several methods and components are repeatedly emphasized across the family:

blocksparse attention alternating with dense attention layers to optimize KV cache savings and speed.
LongRope for extended context handling.
Direct preference optimization (DPO) for preference alignment after SFT.
Use of Mixture-of-Experts (MoE) for some larger configurations to increase parameter capacity with reduced activated parameter counts.

These engineering and alignment choices are presented as central to achieving competitive performance at smaller on-device scales while attempting to improve safety and multimodal reasoning capabilities.

Sources

https://arxiv.org/abs/2404.14219v4