Emojis in Autocompletion: Enhancing Video Search with Visual Cues

Abstract

Effective video search is increasingly challenging due to the inherent complexity and richness of video content, which traditional full-text query systems and text-based autocompletion methods struggle to capture. In this work, we propose an innovative autocompletion system that integrates visual cues, specifically, representative emojis, into the query formulation process to enhance video search efficiency. Our approach leverages cutting-edge Vision-Language Models (VLMs) to generate detailed scene descriptions from videos and employs Large Language Models (LLMs) to distill these descriptions into succinct, segmented search phrases augmented with context-specific emojis. A controlled user study, conducted with 11 university students using the MSVD dataset, demonstrates that the emoji-enhanced autocompletion reduces the average query completion time by 2.27 seconds (14.6% decrease) compared to traditional text-based methods, while qualitative feedback indicates mixed but generally positive user perceptions. These results highlight the potential of combining linguistic and visual modalities to redefine interactive video search experiences.

Challenges in Video Search

Visual Ccomplexity: Text queries often fail to capture the key elements and rich relationships in a scene (e.g., a car yielding to a pedestrian).
The Vocabulary Gap: Users struggle to recall the precise terms needed to describe a complex visual event.
Cognitive Overload: Mentally translating a complex visual into a long, specific text phrase is mentally taxing and unnatural for most users.

Emoji-Enhanced Autocompletion System

Objective: To accelerate user identification of relevant autocompletion suggestions.
Mechanism: Use emojis as pre-attentive visual cues, allowing users to quickly spot potentially relevant items before reading the full text suggestion.
Desired Benefit: Reduced cognitive load, faster task completion, and an improved user experience.

Implementation

Generation Procedure of Emoji-Enhanced Autocompletion

Our system utilizes a two-stage pipeline. First, the VLMs generate a detailed textual description of the video's content. This description is then processed by the LLMs, which distill it into concise, segmented search phrases. For each segment, the LLMs assign a representative emoji and an importance score, creating a compact, semantically rich query suggestion.

Prompt Design for Phrase and Emoji Generation

To generate the emoji-enhanced search phrases, we provide a detailed prompt to the LLMs. This prompt includes specific instructions and examples to guide the model in creating concise, relevant, and visually annotated phrases from a given video description. Below is a sample of the prompt structure we use.


Extract at most 10 search phrases and emojis from the video description paragraph provided above that can be used to find the video.
Requirements:
- Each search phrase must include the objects' actions, characters, and background, directly from the video description.
- Accurately capture the relationships or interactions between objects/characters when applicable.
- Every search phrase must be concise and intuitive between 5 to 10 words.
- Use diverse vocabulary for each phrase, avoiding repetitive or overly similar phrases.
- The emojis should help users to understand the search phrase visually like the examples below.
- Assign an emoji for a phrase which has a specific meaning in the whole phrase like the examples below. 
- Do NOT generate search phrases related to feelings, atmosphere, or emotions.
- Do NOT include phrases describing scene cuts or camera motions.

Example:
- 🚦intersection 🚗red car ➡️turning right 🚧cautiously
- 🚚truck 💥crash 🏍️with motorcycle 😱in front of ego vehicle
- 🚴cyclist ⬅️makes left turn 🚦at intersection
- 🚗car 🛑stops 🚦at red light ⬇️slowing down its speed
- 👕white shirt 👩woman 🚶at crosswalk 👖wear black-pants

Please generate a Python list as a string. Each list item should be a dictionary with the following keys: "phrase", "split", "emojis", and "importance".
- The value of "phrase" should be a string representing a search phrase.
- The value of "split" should be a list of strings, where each item represents a meaningful segment of the search phrase.
- The value of "emojis" should be a list of emojis corresponding to each item in "split". The lengths of "split" and "emojis" must be the same.
- The value of "importance" should be a list of floating-point numbers representing the relative significance of each phrase segment in "split", where higher values indicate greater importance, all values sum to exactly 1.0, and the list length matches that of "split".
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only return the Python list as a string.
For example, your response should look like this:
[
    {
        "phrase": "intersection red car turning right",
        "split": ["intersection", "red car", "turning right", "cautiously"],
        "emojis": ["🚦", "🚗", "➡️", "🚧"],
        "importance": [0.2, 0.4, 0.3, 0.1]
    },
    {
        "phrase": "truck crash with motorcycle in front of ego vehicle",
        "split": ["truck", "crash", "with motorcycle", "in front of ego vehicle"],
        "emojis": ["🚚", "💥", "🏍️", "😱"],
        "importance": [0.3, 0.3, 0.3, 0.1]
    }
]

A Structured Prompt Design for Phrase and Emoji Generation

Experiment

We conducted a user study with 11 university students to evaluate the usability of the emoji-enhanced autocompletion system for video search.

Comparison on Average Query Completion Time

The results show that the mean Query Completion Time (QCT) for the text-only autocompletion was 15.55 seconds, whereas the emoji-text autocompletion reduced the average QCT to 13.28 seconds (a net decrease of 2.27 seconds).

@inproceedings{yoo2025emojis, title={Emojis in Autocompletion: Enhancing Video Search with Visual Cues}, author={Yoo, Hojin and Nandi, Arnab}, booktitle={Proceedings of the 2025 Workshop on Human-In-the-Loop Data Analytics}, year={2025} }

Emojis in Autocompletion:
Enhancing Video Search with Visual Cues

An innovative autocompletion system that integrates visual cues, specifically, representative emojis, into the query formulation process to enhance video search efficiency.