I’m excited to share my latest research on visually grounded image captioning. This work introduces GroundCap, a novel dataset that addresses a critical limitation in current image captioning systems: the inability to link descriptive text to specific visual elements.
What We Studied
Current image captioning models can describe what they see, but they cannot point to specific objects while describing them. Think about how humans communicate about visual scenes – we naturally point to specific elements while describing them to resolve ambiguity. GroundCap bridges this gap by providing captions that are explicitly grounded to visual elements in the image.
We created a dataset of 52,016 images from 77 movies with grounded captions that maintain object identity across references. Our approach uses a novel ID-based grounding system with three types of tags:
- Objects (
<gdo>
): Specific items in the scene - Actions (
<gda>
): What is happening, linked to performing objects - Locations (
<gdl>
): Where things are occurring
The system assigns unique identifiers to each object (e.g., person-0, person-1) and tracks them consistently throughout the caption, enabling complex references like linking actions to specific performers.
Key Findings
Our experimental results demonstrate several important insights:
Strong baseline performance: Fine-tuning Pixtral-12B on GroundCap achieved 96% recall in grounding accuracy, meaning it successfully referenced nearly all detected objects in its captions, with a METEOR score of 0.23 for language quality.
Human evaluation validation: Human-refined captions received high ratings across all criteria (4.34/5.0 overall quality), with particular strength in grounding precision and comprehensive object coverage.
Novel evaluation approach: We introduced gMETEOR, a metric combining caption quality with grounding accuracy. Additionally, ChatGPT-4o evaluations showed strong correlation with human judgments (0.78 Pearson correlation), suggesting potential for scalable automated assessment.
Annotation efficiency: Our refinement-based approach, where annotators improve machine-generated captions rather than writing from scratch, reduced annotation time while maintaining quality and consistency.
The results establish that visually grounded captioning can achieve both high accuracy in object reference and natural language quality, with significant implications for applications requiring verifiable image descriptions.