A critical challenge with querying video data is that the user is often unaware of the contents of the video, its structure, and the exact terminology to use in the query. While these problems exist in exploratory querying settings over traditional structured data, these problems are exacerbated for video data, where the information is sourced from human-annotated metadata or from computer vision models running over the video. In the absence of any guidance, the human is at a loss for where to begin the query session, or how to construct the query. Here, autocompletion-based user interfaces have become a popular and pervasive approach to interactive, keystroke-level query guidance. To guide the user through the query construction process, we develop methods that combine Vision Language Models and Large Language Models for generating query suggestions that are amenable to autocompletion-based user interfaces. Through quantitative assessments over real-world datasets, we demonstrate that our approach provides a meaningful benefit to query construction for video queries.
Sample prompt for autocompletion
In the example above, if the user has typed part of a query, the system can suggest completions based on the context of the query. This feature helps users construct more accurate and relevant queries.
We conducted an experiment to evaluate the effectiveness of our segmented search phrases in improving the search experience. The experiment involved calculating the average minimal keystrokes (MKS) required to complete a search query using our system.
A distribuition of MKS when k = 10
Our system requires on average 10 fewer inputs to complete the desired search query, when suggesting up to 10 queries based on user input.
@inproceedings{yoo2024guided,
title={Guided Querying over Videos using Autocompletion Suggestions},
author={Yoo, Hojin and Nandi, Arnab},
booktitle={Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics},
pages={1--7},
year={2024}
}