Code Llama family — Model overview and fine‑tuning approach

Model family and positioning

The Code Llama family is a set of models positioned for code generation, infilling, and related programming tasks. It emphasizes long-context handling for real downstream applications such as real-time completion in source code editors, docstring generation, type inference, and generation of in-code documentation. The family includes base pretrained models and specialized variants for instruction following and Python-centric code generation, inheriting Llama 2's instruction following and safety properties while being fine-tuned and specialized for coding tasks.

Key motivations behind the design include improving infilling capabilities beyond standard autoregressive next-token prediction, supporting very large contexts, reducing dependence on costly human-supervised data for coding tasks, and better representing multilingual programming scenarios that prior models did not fully address.

Variants and model sizes

Available sizes and specialization axes for the family include base, instruction-tuned, and Python-specialized variants. Major model sizes repeatedly reported are:

7B
13B
34B
70B

Variants explicitly named across releases include Code Llama, Code Llama - Python, and Code Llama - Instruct. Context-length targets reported include 16,384 tokens (for some releases) and up to 100,000 tokens after long-context fine-tuning.

Architecture and notable design choices

Code Llama models are initialized from Llama 2 model weights and retain causal transformer characteristics while incorporating several modifications to support long contexts and infilling:

Fine-tuned to handle long contexts through a distinct Long-Context Fine-Tuning stage (LCFT).
Use of RoPE (rotary positional encodings) for positional information, with a modification: RoPE frequencies were reset with base value θ = 10^6 and other RoPE hyper-parameter changes to enable extrapolation to longer contexts.
A multitask training objective that combines autoregressive next-token prediction with an infilling training objective to improve internal completion and context-aware generation.
Causal masking adapted for infilling tasks.
Training mixes a predominantly code-focused dataset with a small but deliberate fraction of natural language data to preserve general capabilities and safety: a reported 6% of the code dataset and 2% natural language sampling in one configuration to avoid regression.
Safety-focused fine-tuning and red-teaming exercises with domain experts; models are trained to prioritize safety while maintaining coding performance.

No detailed layer counts, hidden sizes, or specific MLP/attention head counts are provided in the reported material.

Tokenizer and prompt format

The family uses a Byte Pair Encoding (BPE) tokenizer. Prompt formats and chat templates applied during instruction tuning or use include bracketed code sections such as "[PYTHON]" and "[/PYTHON]" and the InstructGPT-style format for instruction-following variants.

Pretraining and data mixture

Pretraining and data strategy emphasize code:

The training corpus is described as code-heavy and predominantly composed of publicly available, open-source code.
Several token-volume figures are reported across experiments and stages, including "500B", "1T", "2T", "100B", "5B", "2T tokens of text", and "500B extra tokens mostly of code". A specific reported value for code tokens is "80B tokens of code".
Natural language related to code comprises roughly 8% of the dataset in some descriptions; instruction-tuning and helpfulness/safety datasets (derived from Llama 2 instruction datasets and proprietary sources) also contribute.
For one long-context fine-tuning pass, a small proportion of non-code data (6% code dataset and 2% natural language) was used to prevent regression.

Training documents were split at the character level for documentization transformations, and an infilling transformation was applied with probability 0.9 in certain configurations.

Optimization, hyperparameters, and training regimen

Reported optimizer and schedule details include:

Optimizer: AdamW with β1 = 0.9 and β2 = 0.95.
Cosine learning-rate schedule with 1000 warm-up steps; final learning rate equals 1/30th of the peak learning rate.
Example peak learning rates referenced: 3e-4, 1.5e-4, 1e-4 (across different models/experiments).
Batch and sequence regimes: sequence length of 4,096 tokens for many pretraining runs; batch sizes reported as "4M tokens", "524,288 tokens for Code Llama - Instruct", "2M tokens for 7B and 13B", and "1M tokens for 34B".
Long-context fine-tuning (LCFT) used a learning rate of 2e-5 and training durations reported as 10,000 gradient steps in examples.
A reported probability of applying the infilling transformation is 0.9 for certain model variants.

Objectives: several sizes (7B, 13B, 70B in particular) were trained with the infilling objective; the family employs a multitask objective that explicitly includes infilling alongside standard autoregressive prediction.

Posttraining and instruction tuning

Instruction-tuned variants (notably Code Llama - Instruct) use supervised fine-tuning and synthetic instruction data:

Supervised fine-tuning (SFT) and instruction data sources include proprietary instruction datasets and machine-generated self-instruct datasets.
The instruct variants are tuned to achieve zero-shot instruction-following ability.
Self-instruct and machine-generated data were used to improve zero-shot MBPP scores in reported experiments.

Preference alignment via RLHF or ranking-based preference tuning is not explicitly described in the reported material.

Evaluation highlights and benchmarks

Performance on standard code benchmarks and safety/behavior metrics is reported extensively. Headline results claim state-of-the-art among open-source models on HumanEval, MBPP, and MultiPL-E; the 70B model is reported as state-of-the-art on standard Python completion benchmarks.

Representative benchmark highlights:

HumanEval success rates and pass@1 series for notable models:
Code Llama (7B) — pass@1: 33.5% (also reported: pass@10 59.6%, pass@100 85.9%)
Code Llama (13B) — pass@1: 36.0% (pass@10 69.4%, pass@100 89.8%)
Code Llama (34B) — pass@1: 48.8% (pass@10 76.8%, pass@100 93.0%)
Code Llama (70B) — pass@1: 53.0% (pass@10 84.6%, pass@100 96.2%)
Code Llama - Instruct (70B) — pass@1: 67.8% (pass@10 90.3%, pass@100 97.3%)
Code Llama - Python (70B) — pass@1: 57.3% (pass@10 89.3%, pass@100 98.4%)
MBPP pass@1 highlights for Code Llama family:
Code Llama (7B) — pass@1: 41.4% (pass@10 66.7%, pass@100 82.5%)
Code Llama (70B) — pass@1: 62.4% (pass@10 81.1%, pass@100 91.9%)
Code Llama - Python (70B) — pass@1: 65.6% (pass@10 81.5%, pass@100 91.9%)
Long Code Completion (LCC) improvements with LCFT (Exact Match / BLEU):
Code Llama 7B (non-LCFT): Exact Match 36.86 / BLEU 60.16
Code Llama 7B (LCFT): Exact Match 39.23 / BLEU 61.84
Code Llama 13B (non-LCFT): Exact Match 37.96 / BLEU 61.33
Code Llama 13B (LCFT): Exact Match 41.06 / BLEU 62.76
Code Llama 34B (non-LCFT): Exact Match 42.52 / BLEU 63.74
Code Llama 34B (LCFT): Exact Match 44.89 / BLEU 65.99
Multilingual and comparative performance:
Multi-lingual Human-Eval accuracy shows progressive scale gains, e.g., Code Llama (7B) 26.3%, (13B) 30.6%, (34B) 36.4%, (70B) 45.3%.
Code Llama variants outperform many comparable public models on MultiPL-E and multilingual coding benchmarks.
Safety and truthfulness metrics:
TruthfulQA: improvement reported from 34.64 to 47.37 for the 34B instruct-tuned model.
ToxiGen: toxic generation percentages reduced to virtually 0% for instruct-tuned sizes (e.g., Code Llama - Instruct 34B: 0.00%; 13B: 0.01%; 7B: 0.04%).
BOLD average sentiment scores higher for instruct variants (for example, Code Llama - Instruct 7B: 0.503 vs Code Llama 7B: 0.230), indicating increased positive sentiment across demographic groups.

Comparative baselines such as Llama 2 family, StarCoder variants, Codex (code-cushman-001), GPT-3.5, and GPT-4 appear throughout the benchmark tables.

Strengths and failure modes

Strengths reported include specialization gains for coding tasks, strong multilingual coding performance, long-context capabilities up to 100,000 tokens after LCFT, state-of-the-art open-source performance on several code benchmarks, and substantial reductions in toxic outputs for instruct-tuned variants.

Weaker areas and observed tradeoffs include modest degradations on some standard short-sequence benchmarks when applying long-context or safety/instruction constraints, occasional instabilities in downstream performance for certain configurations, and a reported decrease in pass@1 on some benchmarks after specific fine-tuning stages (e.g., average decreases of 0.52 percentage points on HumanEval and 1.9 points on MBPP in some ablations). Some configurations show that the infilling objective can slightly reduce short-context benchmark scores and that smaller models (e.g., 7B) may struggle with key retrieval when function definitions appear at the beginning of long prompts.

Limitations, caveats, and notable figures

Reported limitations and caveats include instabilities in downstream performance for certain configurations. Notable numbers and dataset generation figures include:

Approximately "∼ 14,000 question-tests-solution triplets generated."
"Generated 62,000 interview-style programming questions", de-duplicated to "∼ 52,000 questions."
Infilling rate reported as "Infilling rate of up to 90%".
Reported average performance drops associated with some changes: "7B model loses 0.6 percentage points" and "13B model loses 1.1 percentage points."

These figures reflect dataset creation, augmentation, and measured impacts noted in evaluation summaries.

Summary

The Code Llama family is a code-focused extension of Llama 2 weights that combines a multitask infilling and autoregressive objective, RoPE-based positional encoding modifications, and a dedicated long-context fine-tuning stage to support very large contexts and improved infilling. Instruction tuning and curated data mixtures drive substantial gains in both code generation quality and safety metrics, with clear scale-based performance improvements across 7B–70B sizes and strong benchmark leadership among open-source models.

Sources

https://arxiv.org/abs/2308.12950v3