Guided Querying over Videos using Autocompletion Suggestions

The Ohio State University


To appear at HILDA 2024

A system that autocompletes queries for retrieval of video content based on zero-shot video understanding enhances user interactivity.

Abstract

A critical challenge with querying video data is that the user is often unaware of the contents of the video, its structure, and the exact terminology to use in the query. While these problems exist in exploratory querying settings over traditional structured data, these problems are exacerbated for video data, where the information is sourced from human-annotated metadata or from computer vision models running over the video. In the absence of any guidance, the human is at a loss for where to begin the query session, or how to construct the query. Here, autocompletion-based user interfaces have become a popular and pervasive approach to interactive, keystroke-level query guidance. To guide the user through the query construction process, we develop methods that combine Vision Language Models and Large Language Models for generating query suggestions that are amenable to autocompletion-based user interfaces. Through quantitative assessments over real-world datasets, we demonstrate that our approach provides a meaningful benefit to query construction for video queries.

The Video Query Suggestion System

Traditional Video Database Management Systems (VDBMS) vs. Ours

Traditional VDBMS

  • Limited Information Utilization: Traditional VDBMS rely solely on the information provided by deep learning models. They can only analyze videos based on the features extracted by these models.
  • Simple Analyses: These systems are capable of basic analyses, such as identifying object locations and directions within videos.
  • Inability for Complex Queries: When faced with complex queries (e.g., finding a video of “a man unloading a truck”), traditional VDBMS may struggle to perform the task effectively.
vs.

Autocompletion System using Vision Language Models

  • Multimodal Capabilities: By integrating Multimodal Large Language Models (MLLMs) like VLMs, the system gains the ability to process data from different modalities (e.g., text, images, videos). This allows for more comprehensive analysis.
  • Behavior and State Analysis: The proposed approach can analyze the behavior and state of objects within videos, providing a more detailed understanding of the video content.
  • Refined Search Queries: Integrating VLMs with VDBMS enables the system to refine and automatically complete user search queries, improving the overall search experience.

Example

Sample prompt for autocompletion

Sample prompt for autocompletion

In the example above, if the user has typed part of a query, the system can suggest completions based on the context of the query. This feature helps users construct more accurate and relevant queries.


Experiment

We conducted an experiment to evaluate the effectiveness of our segmented search phrases in improving the search experience. The experiment involved calculating the average minimal keystrokes (MKS) required to complete a search query using our system.

V2V Performance

A distribuition of MKS when k = 10

Our system requires on average 10 fewer inputs to complete the desired search query, when suggesting up to 10 queries based on user input.

BibTeX

@inproceedings{yoo2024guided,
        title={Guided Querying over Videos using Autocompletion Suggestions},
        author={Yoo, Hojin and Nandi, Arnab},
        booktitle={Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics},
        pages={1--7},
        year={2024}
}