Experience Summary on Building Vertical Domain-Specific Large Models
Development Achievements
Through pre-training and fine-tuning, our AI team has constructed a large model specialized in state-owned enterprise knowledge. From 2024 to the first half of 2025, we conducted two rounds of vertical domain training. Evaluation metrics have surpassed those of the base model and meet user requirements. The overall progress is as follows:
| Dimension | First Round Training | Second Round Training |
|---|---|---|
| Base Model | Qwen1.5-7B | Qwen2.5-72B |
| Pre-training Corpus Size | ~19.5B tokens (gpt tokenizer) | ~1.7B tokens (qwen2.5 tokenizer) |
| Domain Fine-tuning Corpus (Q&A Pairs) | 23,144 | 397,355 |
| Covered Domain Tasks | Domain Q&A, Classification Tasks | Domain Q&A, Report Generation |
| Model Context Length | 2048 | 8192 |
| Training Method | Pre-training + Domain SFT | Pre-training + General SFT + Domain SFT |
| Core Task Metrics (ROUGE) | Domain Q&A: +6% over base model Classification: +50% over base model |
Domain Q&A: +14% over base model Report Generation: +0.65% over base model |
| Existing Issues | Responses overly concise, inadequate for complex scenarios. | Content repetition, still brief responses, deficiencies in understanding ultra-long contexts. |
Evaluation metrics on state-owned enterprise datasets have surpassed those of the base model, with domain Q&A capabilities outperforming DeepSeek and Doubao models. However, writing task capabilities need enhancement. Key current shortcomings include overly concise generated content and suboptimal long-text and contextual comprehension, which will be the focus of subsequent optimization.
Next Steps
Based on existing issues and analysis of professional corpus volume, the directions for next-stage pre-training and fine-tuning are:
-
Corpus Requirements
• Core corpus: 10GB+ internal materials
• Auxiliary corpus: 5GB+ domain-related materials
• Fine-tuning corpus:- Basic Q&A: 50k Q&A pairs
- In-depth analysis: 150k Q&A pairs
- Writing tasks: 250k Q&A pairs
-
Base Model Selection
Qwen3-32B -
Pre-training Method
Parameter-efficient methods (Lora)