I’m happy to share my latest research on visual storytelling. This work addresses a challenge in LLMs: teaching models to distinguish between appropriate and inappropriate entity connections across image sequences, preventing false cross-frame relationships that lead to narrative inconsistencies.
What We Studied
Visual storytelling systems face a fundamental problem: they often connect entities across frames when they shouldn’t. Current approaches train exclusively on coherent image sequences, learning when to establish cross-frame connections but never learning when to avoid them. This leads to models that create false relationships between visually similar but contextually distinct entities – imagine an AI incorrectly identifying two different red cars as the same vehicle simply because they share visual features.
We developed a contrastive reinforcement learning framework that extends the Story Reasoning dataset with synthetic negative examples. Our approach creates 4,178 synthetic stories by algorithmically sampling images from different movies, ensuring visual incoherence while maintaining the same structural requirements as real stories.
Our method incorporates three key innovations:
- Synthetic negative story generation using deterministic sampling to create visually incoherent sequences
- Dual-component reward function that promotes entity re-identification in real stories while penalizing it in synthetic ones
- Direct Preference Optimization training that teaches models appropriate discrimination between coherent and incoherent sequences
Key Findings
Our experimental results demonstrate substantial improvements in entity re-identification and grounding performance:
Significant grounding improvements: Fine-tuning Qwen Storyteller using our contrastive approach improved mean Average Precision (mAP) from 0.27 to 0.31 (+14.8%) and F1 score from 0.35 to 0.41 (+17.1%), indicating better alignment between textual references and visual entities.
Enhanced entity persistence: Cross-frame character and object consistency increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%), demonstrating improved long-term entity tracking capabilities.
Pronoun grounding accuracy: Dramatic improvements in pronoun-entity alignment, with “he” accuracy increasing from 90.1% to 99.1%, “she” from 91.1% to 98.6%, and possessive pronouns showing similar gains, reducing ambiguous references throughout narratives.
Story structure quality: Well-structured stories containing proper chain-of-thought and grounded narratives increased from 79.1% to 97.5% (+23.3%), indicating better overall narrative coherence and format compliance.
The results establish that contrastive training with synthetic negative examples significantly improves model discrimination capabilities, teaching systems not only when to establish entity connections but crucially when to avoid them. This approach has important implications for reliable visual storytelling, content generation, and any application requiring consistent entity tracking across visual sequences.