Coreference Resolution: The Overlooked Problem That Still Trips Up Language Models

"The city council denied the marchers a permit because they feared violence." Who feared violence — the city council or the marchers? Human readers resolve "they" instantly using world knowledge and contextual inference. Language models — even large ones — frequently get this wrong. Coreference resolution, the task of determining which mentions in a text refer to the same real-world entity, is deceptively difficult and critically important for almost every downstream NLP task that requires understanding who does what to whom.

What Is Coreference Resolution?

Given the text: "Sarah told her boss she was leaving. He didn't take it well." — coreference resolution must determine that "her" and "she" both refer to Sarah, and "He" refers to Sarah's boss. The output is a set of coreference chains — clusters of mentions that all refer to the same entity.

Mentions can be:

Pronouns: he, she, it, they, his, her
Definite noun phrases: "the CEO," "the previous model"
Named entities: "Barack Obama" and "Obama" and "the president" in the same document

Why It Matters

Coreference errors propagate into every NLP task that depends on entity tracking:

Information extraction: "Apple announced a new product. The company said it would ship in spring." — without resolving "The company" → Apple and "it" → new product, you cannot extract the full event correctly.
Question answering: "What did the president say about inflation? He stated that..." — answering requires resolving "He" → the president.
Summarization: models that cannot track coreference produce summaries that repeat entity names unnecessarily or lose track of who did what.
Dialogue systems: in a multi-turn conversation, "What is its price?" requires resolving "its" to the entity discussed in a previous turn.

Why It Is Hard

Several factors make coreference uniquely challenging:

1. World knowledge requirements: resolving "The trophy wouldn't fit in the suitcase because it was too big" requires knowing whether "it" refers to the trophy or the suitcase — and this determination depends on world knowledge about what it means for something to "fit" and what "big" implies in that context.

2. Long-range dependencies: in long documents, coreference chains can span hundreds of sentences. The pronoun "she" in paragraph 20 may refer to an entity introduced in paragraph 2.

3. Ambiguous antecedents: "After the negotiation, the delegates met with their advisors. They agreed to a deal." — who agreed? The delegates, the advisors, or both?

4. Bridging anaphora: "I bought a car yesterday. The engine is surprisingly quiet." — "The engine" refers to the car's engine, but this is an implicit coreference that requires compositional reasoning.

Current Approaches

Neural End-to-End Models

SpanBERT and its derivatives (the Lee et al. 2018 end-to-end system) treat coreference as a span-pair scoring problem: enumerate all possible mention spans, then learn to score which pairs corefer. This approach significantly outperformed earlier pipeline systems.

Longformer and Long-Document Models

Standard BERT-based systems truncate at 512 tokens, making them blind to long-range coreference. Longformer-based coref models extend this to 4,096 tokens using sliding-window attention, substantially improving performance on long documents.

LLM-Based Approaches

Recent work has explored using instruction-tuned LLMs for coreference via prompting. LLMs can handle the task zero-shot reasonably well for simple cases, but they lack the structured output reliability needed for production pipelines, and they hallucinate antecedents on ambiguous mentions.

Practical Implications for NLP Engineers

Do not assume coreference is solved: if your pipeline's output depends on entity tracking (who did what), evaluate coreference resolution explicitly on your domain
Use domain-appropriate models: general coreference models trained on OntoNotes degrade on biomedical, legal, or conversational text — domain-specific fine-tuning matters
Budget for coreference computation: end-to-end neural coref models are more expensive than most NLP components; they should be included in latency estimates
Consider pronoun resolution as a fairness issue: biased training data causes models to systematically misresolve gender pronouns for certain occupations

Conclusion

Coreference resolution sits at the heart of language understanding. While the field has made substantial progress since the neural revolution, the gap between human performance and model performance — especially on world-knowledge-dependent cases and long documents — remains significant. For NLP practitioners, treating coreference as a solved problem is a mistake; treating it as a known-hard subproblem that requires explicit attention is the path to better systems.

Keywords: coreference resolution, NLP, pronoun resolution, entity tracking, SpanBERT, natural language understanding, anaphora resolution, information extraction