EMNLP 2025

November 06, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose textbfAQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including textbfAnswer, textbfQuestion, textbfUnlabeled data, textbfInspection, textbfLogic, and textbfTask type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703K examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Codes are available at https://anonymous.4open.science/r/AQuilt-4C3B.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Beyond Demonstrations: Dynamic Vector Construction from Latent Representations
poster

Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

EMNLP 2025

+1
Wang Cai and 3 other authors

06 November 2025