StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

I’m happy to share our latest research on visual storytelling. This work tackles a critical limitation in current storytelling models: semantic hallucination. While existing systems can correctly identify visual entities in images, they often fabricate relationships, misattribute dialogue, and invent emotional states that have no basis in the source material. By aligning visual stories with authentic movie scripts and subtitles, we provide models with the narrative ground truth needed to generate semantically faithful stories.

What We Studied

Visual storytelling models are typically trained on image sequences without access to the original narrative context. This means they must guess at character names, dialogue, and relationships based solely on visual cues — leading to plausible-sounding but factually incorrect narratives. Our work bridges this gap by creating a dataset of 1,757 stories aligned with their corresponding movie scripts and subtitles.

We developed a Longest Common Subsequence (LCS) alignment pipeline that synchronizes screenplay dialogue with subtitle timestamps. The system tokenizes both screenplay and subtitle text, applies LCS matching to identify corresponding dialogue blocks, extends matches bidirectionally until speaker changes occur, and assigns subtitle timestamps to aligned screenplay segments. This enables precise character identification and dialogue attribution grounded in the actual film.

Our approach builds on a progressive three-stage training pipeline:

  • Stage 1 — Qwen Storyteller: Establishes basic visual grounding with Chain-of-Thought reasoning for entity identification
  • Stage 2 — Qwen Storyteller2: Enhances entity re-identification through contrastive reinforcement learning with synthetic negative examples
  • Stage 3 — Qwen Storyteller3: Integrates semantic alignment with movie scripts and subtitles, teaching the model authentic character names, dialogue, and relationships

Key Findings

Our experimental results demonstrate that grounding stories in authentic screenplay data dramatically improves narrative quality across all dimensions:

Subtitle alignment dominance: Qwen Storyteller3 achieved an 89.9% win rate against the base Qwen2.5-VL 7B model on subtitle alignment, demonstrating that our fine-tuned model generates dialogue far more consistent with what characters actually said in the film.

Synopsis and description alignment: The model achieved an 87.6% win rate on synopsis alignment and 63.4% on description alignment versus the base model, showing improved narrative coherence with the original screenplay across both high-level plot and scene-level details.

Dramatic QA accuracy improvements: On factual question-answering, our model reached 93.9% overall accuracy compared to 63.2% for the baseline. The most striking gains were in character relationships (94.7% vs. 55.3%), character actions (97.4% vs. 72.0%), and emotional states (89.5% vs. 52.2%).

Dialogue attribution improvement: Storyteller3 demonstrated 48.5% accuracy on subtitle alignment evaluation versus 38.0% for the original Storyteller model, confirming that each progressive training stage contributes meaningful improvements to the final system.

These results establish that access to authentic narrative sources — scripts and subtitles — is essential for generating semantically grounded visual stories. Our progressive training approach, culminating in script-subtitle alignment, produces models that tell stories reflecting what actually happened rather than what merely looks plausible.

Resources

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top