China

Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.

EMNLP 2025

QA‑LIGN: Aligning LLMs through Constitutionally Decomposed QA

llm alignment

safety

decomposition

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Controlled paraphrase generation produces rephrasings that preserve meaning while enabling targeted stylistic and linguistic modifications. We introduce LingConv, an encoder-decoder framework supporting fine-grained control over 40 linguistic attributes in English. A novel inference-time quality control mechanism iteratively refines attribute embeddings, yielding paraphrases that closely match target properties without sacrificing semantic fidelity. LingConv reduces attribute error by up to 34% over baselines, with quality control providing an additional 14% improvement.

Linguistically-Controlled Paraphrase Generation

Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential “it,” zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

We introduce a Neural-Symbolic Task Planning framework integrating Large Language Model (LLM) decomposition with category-theoretic verification for resource-aware, temporally consistent planning. Our approach represents states as objects and valid operations as morphisms in a categorical framework, ensuring constraint satisfaction through mathematical pullbacks. We employ bidirectional search that simultaneously expands from initial and goal states, guided by a learned planning distance function that efficiently prunes infeasible paths. Empirical evaluations across three planning domains demonstrate that our method outperforms existing baselines by up to 26.1% in completion rates while reducing relative resource violation rate by up to 77%. These results highlight the synergy between LLM-based operator generation and category-theoretic verification for reliable planning in domains requiring both resource-awareness and temporal consistency.

A Category-Theoretic Approach to Neural-Symbolic Task Planning with Bidirectional Search

Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene–task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40times higher than base prompts. Evaluating 11 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies — highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati -- an unwritten language -- that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Like most languages, sign languages evolve over time. It is important that sign language dictionaries' vocabularies are updated over time to reflect these changes, such as by adding new signs. However, most dictionary retrieval methods based upon machine learning models only work with fixed vocabularies, and it is unclear how they might support dictionary expansion without retraining. In this work, we explore the feasibility of dictionary expansion for sign language dictionaries using a simple representation-based method. We explore a variety of dictionary expansion scenarios, e.g., varying number of signs added as well as amount of data for these newly added signs. Through our results, we show how performance varies significantly across different scenarios, many of which are reflective of real-world data challenges. Our findings offer implications for the development & maintenance of video-based sign language dictionaries, and highlight directions for future research on dictionary expansion.

Investigating Dictionary Expansion for Video-based Sign Language Dictionaries

Vision-Language Models have shown impressive capabilities in image understanding tasks but their ability to interpret diverse data visualizations with varying information density remains underexplored. This paper presents a systematic evaluation of VLMs on process visualizations - a class of visualization requiring both visual literacy and sequential reasoning. We evaluate models across three process visualization techniques on multiple tasks. We evaluate extraction performance on charts, and construct a benchmark dataset of expert-validated QA pairs. Our findings reveal that while frontier models achieve strong performance on extraction and single hop reasoning (>80% accuracy), they struggle on multi-hop reasoning tasks. Open-source models struggle with all reasoning tasks. We also perform an analysis of chart information density showing performance decreases for models as information complexity increases.

ProcVQA: Benchmarking the Effects of Structural Properties in Mined Process Visualizations on Vision–Language Model Performance

Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Who Speaks Matters: Analysing the Influence of the Speaker's Linguistic Identity on Hate Classification

Large Language Models' (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs’ cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model's empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behavior patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behavior.

Downloads

Next from EMNLP 2025

Linguistically-Controlled Paraphrase Generation