Skip to content

WaveCoder Model Documentation

Overview

Model Names: WaveCoder, WaveCoder-Ultra-6.7B, WaveCoder-DS-6.7B, DeepseekCoder-Base-6.7B
Category: Code Large Language Models, Code Generation

WaveCoder is a state-of-the-art code generation model designed to enhance the performance of code-related tasks, particularly in complex multi-task scenarios. It aims to address the limitations of existing instruction tuning methods by producing high-quality, diverse instructional data.

Problem Statement

Challenges Addressed

  • Performance Enhancement: WaveCoder significantly improves the performance of Code LLMs on complex multi-task scenarios.
  • Data Quality: It addresses the issue of low data quality and redundancy in existing instruction datasets, which often leads to poor model performance.
  • Generalization: The model enhances the generalization ability of pre-trained models for code-related tasks.

Limitations of Existing Methods

  • Current instruction tuning techniques primarily focus on traditional code generation tasks, often resulting in duplicate instruction instances.
  • Existing open-source Code LLMs do not achieve state-of-the-art generalization performance.

Key Contributions

  • CodeSeaXDataset: Introduces a dataset comprising 19,915 instruction instances across four code-related tasks.
  • Enhanced Instruction Generation: Proposes a versatile method for generating diverse and high-quality instruction data tailored to specific task requirements.
  • Evaluation Performance: Demonstrates exceptional performance on benchmarks such as HumanEval and MBPP.
  • Multi-task Integration: Incorporates multiple code-related tasks into the training data, significantly enhancing model performance.

Model Variants

WaveCoder Variants

  1. WaveCoder-Pro-6.7B:

  2. Focus: Code generation.

  3. Achievements: 72.0% pass@1 on HumanEval, 63.6% on MBPP.
  4. Training: Utilizes the GPT-4 enhanced CodeSeaXDataset.

  5. WaveCoder-DS-6.7B:

  6. Focus: Code generation and related tasks.

  7. Improvements: Enhanced performance on code-related tasks through data refinement and diversification.

  8. WaveCoder-Ultra-6.7B:

  9. Focus: General code-related tasks.

  10. Achievements: State-of-the-art generalization capabilities across a wide range of tasks.

Methodology

Algorithm Framework

  • Generator-Discriminator Model: Utilizes a framework that combines a generator for instruction creation and a discriminator for data quality assurance.

Techniques Employed

  • Enhanced Instruction Data Generation: Generates diverse instruction data from open-source code datasets, addressing poor performance in complex tasks.
  • KCenterGreedy Algorithm: Ensures data diversity by selecting representative samples from the dataset.
  • LLM-based Discriminator: Analyzes and filters instruction data to maintain high quality.

Evaluation Metrics

Benchmarks and Performance

  • HumanEval:

  • WaveCoder-Pro-6.7B: 72.0% pass@1

  • WaveCoder-DS-6.7B: 64.0% with all tasks included.
  • MBPP:

  • WaveCoder-Pro-6.7B: 63.6%

  • Comparative Performance: WaveCoder models consistently outperform other open-source models, including WizardCoder and OctoCoder.

Limitations

  • Dataset Size: The training dataset consists of only 19,915 instructions, which may limit the model's enhancements and generalization capabilities.
  • Performance Gaps: While WaveCoder excels among open-source models, it still lags behind proprietary models and state-of-the-art open-source counterparts.

Conclusion

WaveCoder represents a significant advancement in code generation and multi-task learning for code-related tasks. By focusing on data quality and diversity, it addresses key limitations of existing models, paving the way for improved performance in complex coding scenarios.

Sources

https://arxiv.org/abs/2312.14187v5