WaveCoder — Widespread and Versatile Enhancement for Code Large Language Models

Overview and Goals

WaveCoder is an instruction-tuning approach and family of code-focused model variants developed in association with Tsinghua University and Microsoft. The stated objective is to improve generalization ability of Code LLMs across multiple code-related tasks, including code generation from descriptions, code summarization, error identification and repair, and translation between programming languages. The project emphasizes improving data quality and controllability in instruction generation to better support complex multi-task scenarios where existing instruction tuning methods are claimed to be insufficient.

Key Components and Contributions

WaveCoder’s workflow centers on creating high-quality, diverse instruction data and using it to fine-tune several base code models. The principal contributions and artifacts reported are:

CodeSeaXDataset containing 19,915 instruction instances across 4 code-related tasks.
A widespread and versatile enhanced instruction generation method implemented as a Generator-Discriminator framework.
Use of GPT-4 as a discriminator and GPT-3.5 for instruction generation, plus rules inspired by Zero-shot-CoT and few-shot examples including both good and bad examples.
Combination of CodeSeaXDataset with WaveCoder-evol-codealpaca to create a 130K dataset for training.
Fine-tuning of base models for 3 epochs with parallel training strategies (Tensor Parallel and FSDP) and explicit hyperparameters for different base models.

Dataset and Instruction Generation Method

The training data strategy combines an initial curated dataset and an LLM-enhanced expansion pipeline:

The initial dataset is described as a 20K dataset, referenced as the CodeSeaXDataset with 19,915 instructions and coverage of four common code-related task types (code generation, summarization, repair, translation).
The instruction-generation pipeline is a Generator-Discriminator framework. GPT-3.5 is used to generate candidate instructions and GPT-4 is used as an LLM-based discriminator to assess quality according to established rules. The rules are inspired by Zero-shot-CoT reasoning patterns and training uses both positive and negative few-shot examples to guide the discriminator.
A KCenterGreedy sampling algorithm is used to promote data diversity in dataset construction.
The enhanced instruction data produced by the generator-discriminator pipeline was combined with WaveCoder-evol-codealpaca to form a larger 130K dataset used for final tuning.

Models, Variants, and Bases

WaveCoder is presented as a family with multiple reported variants and baselines. Variant names and reported parameter choices include, among others:

WaveCoder-Ultra-6.7B (reported with 6.7B parameters and 130K max context tokens in some entries).
WaveCoder-Pro-6.7B (6.7B, 20K max context tokens reported in some entries).
WaveCoder-DS-6.7B / DeepseekCoder-Base-6.7B (both associated with 6.7B).
WaveCoder-SC-15B (associated with 15B and StarCoder base).
Variants built on or compared with StarCoder (15B), CodeLLaMa (7B / 13B), OctoCoder (15B, 13K context), WizardCoder (15B, 78K context), Magicoder-DS (6.7B, 75K context), and multiple CodeLLaMa-derived instruct variants.

Many variant entries repeat across the reported list; reported parameter counts include 6.7B, 7B, 13B, and 15B for different variants. Several WaveCoder variants are described as being derived from either StarCoder or DeepseekCoder bases.

Architecture and Training Practices

Detailed architecture-level specifications (layers, hidden sizes, MLP sizes, attention/kv heads, dense vs. MoE choices) are not specified in the available material. Notable design and training choices that are reported:

KCenterGreedy algorithm used for data selection to increase diversity.
Fine-tuning for 3 epochs was performed for base models.
Parallelization and distribution choices:
Tensor Parallel used for StarCoder-15B, CodeLLaMa-7B, and CodeLLaMa-13B.
Fully Sharded Data Parallel (FSDP) used for DeepseekCoder-6.7B.
Hardware: training used NVIDIA A100-80GB GPU resources.

Training hyperparameters reported for fine-tuning:

Initial learning rate: 2e-5 for StarCoder-15B, CodeLLaMa-7B, and CodeLLaMa-13B; 5e-5 for DeepseekCoder-6.7B.
Global batch size: 256 for StarCoder and CodeLLaMa variants; 512 for DeepseekCoder-6.7B.

Tokenizer type, vocabulary size, chat/prompt templates, and system prompt specifics are not reported.

Fine-tuning and Post-training

Fine-tuning (supervised fine-tuning with enhanced instruction data) used the enhanced 20K CodeSeaXDataset as described and the expanded 130K combined dataset. The fine-tuning objective and any preference alignment or reinforcement learning from human feedback steps are not described in detail beyond the supervised instruction-tuning process.

Evaluation: Benchmarks and Results

Evaluation highlights emphasize improved generalization across code-related tasks and multiple benchmark results are reported. Headline claims assert that WaveCoder models “significantly outperform other open-source models in terms of generalization ability” and achieve “state-of-the-art generalization performance on different code-related tasks,” while also noting they remain behind some SoTA models.

Selected reported benchmark outcomes (preserving reported numbers exactly):

HumanEval (pass@1):
Achieved 72.0% pass@1 on HumanEval (headline).
A comparative listing includes: GPT-4: 85.4 / 67.0, ChatGPT: 73.2 / 48.1, StarCoder: 33.6, OctoCoder: 46.2, WizardCoder: 57.3, WaveCoder-SC-15B: 50.5 (+16.9), CodeLLaMa: 33.5, CodeLLaMa-instruct: 34.8, WaveCoder-CL-7B: 48.1 (+14.6), CodeLLaMa: 36.0, CodeLLaMa-instruct: 42.5, WaveCoder-CL-13B: 55.4 (+19.4), DeepseekCoder: 49.4, Magicoder-DS: 66.5, WaveCoder-DS-6.7B: 64.0 (+14.6), WaveCoder-Pro-6.7B: 72.0 (+22.6), DeepseekCoder-instruct: 73.8, Magicoder-S-DS: 76.8, WaveCoder-Ultra-6.7B: 78.6 (+29.2).
MBPP (pass@1):
Comparative listing includes: GPT-4: 67.0, ChatGPT: 48.1, StarCoder: 43.3, OctoCoder: 43.5, WizardCoder: 51.8, WaveCoder-SC-15B: 51.0 (+7.4), CodeLLaMa: 41.4, CodeLLaMa-instruct: 44.4, WaveCoder-CL-7B: 47.2 (+5.8), CodeLLaMa: 47.0, CodeLLaMa-instruct: 49.4, WaveCoder-CL-13B: 49.6 (+2.6), DeepseekCoder: 60.6, Magicoder-DS: 60.4, WaveCoder-DS-6.7B: 62.8 (+2.2), WaveCoder-Pro-6.7B: 63.6 (+3.0), DeepseekCoder-instruct: 62.8, Magicoder-S-DS: 64.6, WaveCoder-Ultra-6.7B: 64.4 (+3.8).
HumanEval variants:
HumanEvalFix (pass@1): WaveCoder-DS-6.7B: 49.4%, GPT-4: 47.8%; comparators include WizardCoder: 25.7%, OctoCoder: 27.0%.
HumanEvalExplain (pass@1): WaveCoder-DS-6.7B: 41.3%, WizardCoder: 27.5%, OctoCoder: 24.5%.
HumanEvalExplain across languages (pass@1 averages reported): 56.7 (Python), 50 (JavaScript), 54.3 (Java), 34.8 (Go), 51.2 (C++), 36.6 (Rust), 47.3 (Avg).
CodeXGLUE (Task Performance breakdown):
Code Generation: 11,370 samples (57.1%)
Code Summarization: 3,165 samples (15.8%)
Code Repair: 3,144 samples (15.8%)
Code Translation: 2,236 samples (11.2%)
HumanEval (Score): WaveCoder-DS-6.7B: 64.0, DeepseekCoder-Base-6.7B: 49.4.
HumanEval Fix (Avg.) (Score): WaveCoder-DS-6.7B: 49.4 (+20.4), DeepseekCoder-Base-6.7B: 29.0.
HumanEval Explain (Avg.) (Score): WaveCoder-DS-6.7B: 41.3 (+7.3), DeepseekCoder-Base-6.7B: 34.6.

Reported evaluation conclusions highlight where WaveCoder “wins” and where it is weaker:

Where it excels: outstanding generalization ability driven by widespread and versatile enhanced instruction tuning; claims of outperforming open-source models on code-related tasks and best performance when training includes all four code-related task types.
Where it is weaker: still reportedly behind SoTA Code LLMs in some respects; omission of any task type from training can cause HumanEval score drops; limited improvements attributed to the relatively small training dataset.

Limitations and Caveats

Reported limitations and caveats include:

Data leakage issues identified between HumanEval and the Magicoder-evol-codealpaca dataset.
The training dataset includes only 19,915 instructions in the initial CodeSeaXDataset, which is noted as limited scale and a target for future expansion.
Future directions suggested: broader coverage of more code-related task types and scaling to larger datasets.

An external numeric reference included in the report notes that CodeSearchNet contains 2 million pairs, cited as context for large-scale code data availability.

Summary of Methods and Practical Notes

The WaveCoder approach centers on high-quality instruction generation and targeted fine-tuning. The combination of a curated 20K instruction dataset, an LLM-based generator-discriminator pipeline (with GPT-3.5 generation and GPT-4 discrimination), KCenterGreedy sampling for diversity, and focused fine-tuning with concrete hyperparameters and distributed training strategies are the core procedural elements. Reported benchmark numbers suggest substantial gains in code generalization compared to many open-source baselines, with a variety of model variants evaluated across HumanEval, MBPP, CodeXGLUE, and related HumanEval variants.

Sources

https://arxiv.org/abs/2312.14187v5