Experience Summary on Building Vertical Domain-Specific Large Models

Through pre-training and fine-tuning, our AI team has constructed a large model specialized in state-owned enterprise knowledge. From 2024 to the first half of 2025, we conducted two rounds of vertical domain training. Evaluation metrics have surpassed those of the base model and meet user requirements.

Experience Summary on Building Vertical Domain-Specific Large Models

Development Achievements

Through pre-training and fine-tuning, our AI team has constructed a large model specialized in state-owned enterprise knowledge. From 2024 to the first half of 2025, we conducted two rounds of vertical domain training. Evaluation metrics have surpassed those of the base model and meet user requirements. The overall progress is as follows:

Dimension First Round Training Second Round Training
Base Model Qwen1.5-7B Qwen2.5-72B
Pre-training Corpus Size ~19.5B tokens (gpt tokenizer) ~1.7B tokens (qwen2.5 tokenizer)
Domain Fine-tuning Corpus (Q&A Pairs) 23,144 397,355
Covered Domain Tasks Domain Q&A, Classification Tasks Domain Q&A, Report Generation
Model Context Length 2048 8192
Training Method Pre-training + Domain SFT Pre-training + General SFT + Domain SFT
Core Task Metrics (ROUGE) Domain Q&A: +6% over base model
Classification: +50% over base model
Domain Q&A: +14% over base model
Report Generation: +0.65% over base model
Existing Issues Responses overly concise, inadequate for complex scenarios. Content repetition, still brief responses, deficiencies in understanding ultra-long contexts.

Evaluation metrics on state-owned enterprise datasets have surpassed those of the base model, with domain Q&A capabilities outperforming DeepSeek and Doubao models. However, writing task capabilities need enhancement. Key current shortcomings include overly concise generated content and suboptimal long-text and contextual comprehension, which will be the focus of subsequent optimization.

Next Steps

Based on existing issues and analysis of professional corpus volume, the directions for next-stage pre-training and fine-tuning are:

  1. Corpus Requirements
    • Core corpus: 10GB+ internal materials
    • Auxiliary corpus: 5GB+ domain-related materials
    • Fine-tuning corpus:

    • Basic Q&A: 50k Q&A pairs
    • In-depth analysis: 150k Q&A pairs
    • Writing tasks: 250k Q&A pairs
  2. Base Model Selection
    Qwen3-32B

  3. Pre-training Method
    Parameter-efficient methods (Lora)