Story Generation from Visual Inputs: A Comprehensive Survey of Techniques and Challenges

I’m happy to share my survey on visual story generation. This work provides an extensive analysis of methodologies, datasets, and evaluation approaches for generating engaging narratives from visual data.

What We Studied

Visual Story Generation (VSG) represents a significant advancement beyond traditional image captioning, requiring systems to understand complex relationships, temporal progression, and implicit context within visual sequences. Rather than simply describing what appears in images, VSG systems must craft coherent narratives that capture the underlying story and engage audiences.

Our survey examines the evolution from early rule-based systems to modern deep learning approaches. We analyze fundamental story elements – character, conflict, theme, setting, plot, and mode – that serve as building blocks for narrative generation. The work covers the progression from template-based methods to sophisticated transformer architectures and Large Language Models.

We systematically evaluate benchmark datasets, including VIST (50,000 stories), Writing Prompts (300,000 text stories), and Visual Writing Prompts (12,000 visually grounded stories), alongside evaluation metrics from BLEU and METEOR to newer approaches like BERTScore and potential LLM-based evaluation systems.

Key Findings

Our analysis reveals significant insights about the current state and future directions of visual storytelling:

Evolution of architectures: The field has progressed from CNN-RNN encoder-decoder approaches to more sophisticated systems incorporating object detection and transformer architectures.

Evaluation challenges: Current automatic metrics show poor correlation with human judgment – METEOR achieves only 0.22 Pearson correlation with human assessment, highlighting the need for better evaluation approaches that capture creativity, coherence, and emotional engagement.

Technical limitations: Early rule-based and grammar-based systems struggled with language variation and the “show, don’t tell” principle, while modern deep learning approaches excel at fluent narrative generation but may suffer from hallucinations or factual inconsistencies.

Promising directions: Recent work incorporating explicit character modeling (CharGrid) and hypergraph-enhanced reasoning (HEGR) demonstrates improved performance, suggesting that explicitly modeling story elements leads to better narrative quality.

Underexplored potential: Large Language Models remain significantly underutilized in VSG, with most work relying on older models like GPT-2 rather than leveraging the capabilities of modern LLMs.

The survey identifies critical research gaps, including the need for robust automatic evaluation metrics, better integration of LLMs with visual understanding, and explicit modeling of narrative elements like character development and thematic coherence.

Resources

Paper

What We Studied

Key Findings

Resources

Leave a Comment Cancel Reply