I’m happy to share my latest research on visual storytelling. This work addresses a critical challenge in AI systems: generating coherent stories from visual content while maintaining consistent character identities and avoiding referential hallucinations.
What We Studied
Visual storytelling systems face significant challenges when processing multiple images to create narrative content. Current approaches struggle with two fundamental problems: maintaining character identity across different frames and correctly linking actions to the appropriate subjects. These issues frequently lead to referential hallucinations, where the system incorrectly attributes actions to the wrong characters or creates inconsistent character references.
We developed StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images. The dataset features both structured scene analyses and grounded stories that maintain character and object consistency across frames through explicit modeling of multi-frame relationships using structured tabular representations.
Our approach incorporates three key innovations:
- Cross-frame object re-identification using visual similarity and face recognition
- Chain-of-thought reasoning for explicit narrative modeling
- Multi-frame grounding scheme that links textual elements to visual entities consistently across multiple images
Key Findings
Our experimental results demonstrate substantial improvements in story generation quality:
Significant reduction in hallucinations: Fine-tuning Qwen2.5-VL 7B on StoryReasoning created “Qwen Storyteller,” which reduced hallucinations from 4.06 to 3.56 per story on average – a 12.3% improvement compared to the non-fine-tuned model.
End-to-end performance: The fine-tuned model performs comprehensive scene understanding, including object detection, re-identification, and landmark detection while maintaining consistent object references throughout multi-frame stories.
Structured reasoning capabilities: The chain-of-thought approach enables explicit narrative modeling, allowing the system to reason about character relationships and story progression in a more systematic way.
Cross-frame consistency: Our re-identification system successfully tracks characters and objects across different frames, enabling coherent storytelling that maintains visual and narrative continuity.
The results establish that combining structured scene analysis with grounded story generation significantly improves the quality and reliability of AI-generated visual narratives, with important implications for content creation, educational applications, and accessibility tools.