Effective video search is increasingly challenging due to the inherent complexity and richness of video content, which traditional full-text query systems and text-based autocompletion methods struggle to capture. In this work, we propose an innovative autocompletion system that integrates visual cues, specifically, representative emojis, into the query formulation process to enhance video search efficiency. Our approach leverages cutting-edge Vision-Language Models (VLMs) to generate detailed scene descriptions from videos and employs Large Language Models (LLMs) to distill these descriptions into succinct, segmented search phrases augmented with context-specific emojis. A controlled user study, conducted with 11 university students using the MSVD dataset, demonstrates that the emoji-enhanced autocompletion reduces the average query completion time by 2.27 seconds (14.6% decrease) compared to traditional text-based methods, while qualitative feedback indicates mixed but generally positive user perceptions. These results highlight the potential of combining linguistic and visual modalities to redefine interactive video search experiences.
Generation Procedure of Emoji-Enhanced Autocompletion
Our system utilizes a two-stage pipeline. First, the VLMs generate a detailed textual description of the video's content. This description is then processed by the LLMs, which distill it into concise, segmented search phrases. For each segment, the LLMs assign a representative emoji and an importance score, creating a compact, semantically rich query suggestion.
To generate the emoji-enhanced search phrases, we provide a detailed prompt to the LLMs. This prompt includes specific instructions and examples to guide the model in creating concise, relevant, and visually annotated phrases from a given video description. Below is a sample of the prompt structure we use.
Extract at most 10 search phrases and emojis from the video description paragraph provided above that can be used to find the video.
Requirements:
- Each search phrase must include the objects' actions, characters, and background, directly from the video description.
- Accurately capture the relationships or interactions between objects/characters when applicable.
- Every search phrase must be concise and intuitive between 5 to 10 words.
- Use diverse vocabulary for each phrase, avoiding repetitive or overly similar phrases.
- The emojis should help users to understand the search phrase visually like the examples below.
- Assign an emoji for a phrase which has a specific meaning in the whole phrase like the examples below.
- Do NOT generate search phrases related to feelings, atmosphere, or emotions.
- Do NOT include phrases describing scene cuts or camera motions.
Example:
- 🚦intersection 🚗red car ➡️turning right 🚧cautiously
- 🚚truck 💥crash 🏍️with motorcycle 😱in front of ego vehicle
- 🚴cyclist ⬅️makes left turn 🚦at intersection
- 🚗car 🛑stops 🚦at red light ⬇️slowing down its speed
- 👕white shirt 👩woman 🚶at crosswalk 👖wear black-pants
Please generate a Python list as a string. Each list item should be a dictionary with the following keys: "phrase", "split", "emojis", and "importance".
- The value of "phrase" should be a string representing a search phrase.
- The value of "split" should be a list of strings, where each item represents a meaningful segment of the search phrase.
- The value of "emojis" should be a list of emojis corresponding to each item in "split". The lengths of "split" and "emojis" must be the same.
- The value of "importance" should be a list of floating-point numbers representing the relative significance of each phrase segment in "split", where higher values indicate greater importance, all values sum to exactly 1.0, and the list length matches that of "split".
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only return the Python list as a string.
For example, your response should look like this:
[
{
"phrase": "intersection red car turning right",
"split": ["intersection", "red car", "turning right", "cautiously"],
"emojis": ["🚦", "🚗", "➡️", "🚧"],
"importance": [0.2, 0.4, 0.3, 0.1]
},
{
"phrase": "truck crash with motorcycle in front of ego vehicle",
"split": ["truck", "crash", "with motorcycle", "in front of ego vehicle"],
"emojis": ["🚚", "💥", "🏍️", "😱"],
"importance": [0.3, 0.3, 0.3, 0.1]
}
]
A Structured Prompt Design for Phrase and Emoji Generation
We conducted a user study with 11 university students to evaluate the usability of the emoji-enhanced autocompletion system for video search.
Comparison on Average Query Completion Time
The results show that the mean Query Completion Time (QCT) for the text-only autocompletion was 15.55 seconds, whereas the emoji-text autocompletion reduced the average QCT to 13.28 seconds (a net decrease of 2.27 seconds).
@inproceedings{yoo2025emojis,
title={Emojis in Autocompletion: Enhancing Video Search with Visual Cues},
author={Yoo, Hojin and Nandi, Arnab},
booktitle={Proceedings of the 2025 Workshop on Human-In-the-Loop Data Analytics},
year={2025}
}