Chinese-Vicuna

Overview and Positioning

Chinese-Vicuna is a family of instruction-tuned models derived from the LLaMA architecture and adapted to Chinese conversational and domain-specific tasks. The project aims to bridge the gap in Chinese instruction-following capabilities while providing a resource-efficient solution suitable for consumer GPUs. Key motivations include addressing resource-intensity in existing models and the lack of domain-specific adaptation for Chinese, particularly in healthcare and law. The approach emphasizes parameter-efficient fine-tuning and practical deployment workflows.

Key contributions reported include:

Fine-tuning the LLaMA architecture using LoRA (Low-Rank Adaptation) to enable parameter-efficient adaptation.
Support for domain-specific adaptation in healthcare and legal workflows.
Enabling cost-effective deployment on consumer GPUs through quantization and optimized training.
Utilization of diverse hybrid datasets (BELLE, Guanaco, ShareGPT, translated Alpaca) for instruction tuning in Chinese.
Enhanced multilingual and conversational capabilities with reported performance comparable to ChatGPT on Chinese tasks.
Open-source tools and documented implementations for quantization and deployment.

Model Variants and Specializations

Primary model sizes documented are Chinese-Vicuna 7B and Chinese-Vicuna 13B, both presented as instruction-tuned variants for Chinese. Several specialized or continued-fine-tune variants are reported, including names such as Chinese-Vicuna-medical7B, Chinese-Vicuna-continue-finetune-7epoch-cMedQA2, Chinese-Vicuna-7b-legal-lora, Chinese-Vicuna-Legal, and Ours-7b-chatv1. Some variants are explicitly targeted at medical and legal domains.

Training hyperparameters reported for the 7B/13B family (from the 7B Model / 13B Model entry) include: BATCH_SIZE: 128, EPOCHS: 3, LEARNING_RATE: 3 × 10^-4, CUTOFF_LEN: 256, LORA_R: 8, LORA_ALPHA: 16, LORA_DROPOUT: 0.05, TARGET_MODULES: {q_proj, v_proj}, USE_8bit: True. Language support is reported as Chinese.

Architecture and Design Choices

The base architecture is LLaMA. Notable design and engineering choices focus on modular, parameter-efficient adaptation and quantization for deployment:

Low-Rank Adaptation (LoRA) is used for fine-tuning to reduce the number of trainable parameters and enable faster, cheaper adaptation to downstream tasks.
Support for 8-bit and 4-bit quantization is included: an 8-bit configuration is used for the 7B model, and a 4-bit configuration (referred to as QLoRA) is used for the 13B model.
The system is described as modular to support domain-specific adaptation (medical, legal), including continued fine-tuning on domain datasets such as cMedQA2.
Additional design elements mentioned include integration of RLHF for alignment and dynamic knowledge retrieval to mitigate temporal data gaps.

Tokenization and Prompt Format

Tokenization specifics (tokenizer type and vocabulary size) are not reported. The conversational prompt format used in examples is a simple chat template:

User: input

Assistant: output

No system-prompt conventions or further tokenizer details are provided.

Training: Data, Compute, and Fine-tuning

Pretraining and instruction-tuning use a hybrid mixture of instruction-style datasets. The reported data mixture includes BELLE, Guanaco, ShareGPT, and an Alpaca instruction dataset translated from the Alpaca chat dataset. Language coverage for the instruction tuning is reported as 1 language (Chinese).

Important training and compute notes:

Quantization-specific configurations reported as important hyperparameters: 8-bit quantization for the 7B model and 4-bit quantization for the 13B model.
Fine-tuning is reported to be feasible on consumer GPUs (examples cite RTX-2080Ti hardware). Reported compute summaries include training on four 2080Ti GPUs, with the 7B model training taking approximately 2.5 days and the 13B model training taking approximately 4 days.
A separate notable claim states: "Fine-tuning 7B/13B models on consumer GPUs takes approximately 30 hours on an RTX-2080Ti."

Post-training and specialization:

Supervised fine-tuning (SFT) was used for domain adaptation; SFT datasets include domain-specific collections from medical and legal fields and fine-tuning on cMedQA2.
Preference-alignment methods and data are not specified in the training details.

Evaluation, Capabilities, and Known Behavior

Headline assessment highlights competitive performance in translation, code generation, and domain-specific Q&A. The models have reported strengths and weaknesses as follows.

Reported benchmark and behavior findings:

On cMedQA 2.0 the reported repetition penalty values are "1 for multiple questions" and "3 for one question."
Medical Q&A: reported to have stronger medical question-answering capability compared to models not fine-tuned on medical data.
Legal QA: reported to perform well across various legal question-answering tasks.
Where the approach is said to win: medical tasks, multi-turn dialogue coherence, real-time legal updates, and instruction-following ability in Chinese.
Where it is weaker: tendency to become repetitive under a low repetition penalty and limited ability in role-playing for general scenarios.

Notable Numbers, Implementation, and Resources

A selection of notable numeric claims and available resources:

"Fine-tuning 7B/13B models on consumer GPUs takes approximately 30 hours on an RTX-2080Ti" (reported verbatim).
Implementation resources and documentation are reported as available at: https://github.com/LZY-the-boys/CustomLLMFinetuningHandbook

Summary of Key Technical Elements

The development emphasizes parameter-efficient fine-tuning via LoRA, quantization strategies (QLoRA for 4-bit, 8-bit for smaller models), modular domain adapters for medical and legal use cases, and practical training setups that target consumer-level GPU hardware. The approach prioritizes Chinese instruction-following and domain specialization while enabling cost-effective local deployment.

Sources

https://arxiv.org/abs/2504.12737v1