vidformer - Video Data Transformation

Test PyPI version Crates.io Version Open In Colab License

A research project providing infrastructure for video interfaces and pipelines. Developed by the OSU Interactive Data Systems Lab.

🎯 Why vidformer

Vidformer efficiently transforms video data, enabling faster annotation, editing, and processing of video dataβ€”without having to focus on performance.

It uses a declarative specification format to represent transformations. This enables:

  • ⚑ Transparent Optimization: Vidformer optimizes the execution of declarative specifications just like a relational database optimizes relational queries.

  • ⏳ Lazy/Deferred Execution: Video results can be retrieved on-demand, allowing for practically instantaneous playback of video results.

  • πŸ”„ Transpilation: Vidformer specifications can be created from existing code (like cv2).

πŸš€ Quick Start

Open In Colab

The easiest way to get started is using vidformer's cv2 frontend, which allows most Python OpenCV visualization scripts to replace import cv2 with import vidformer.cv2 as cv2:

import vidformer.cv2 as cv2

cap = cv2.VideoCapture("my_input.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

out = cv2.VideoWriter("my_output.mp4", cv2.VideoWriter_fourcc(*"mp4v"),
                        fps, (width, height))
while True:
    ret, frame = cap.read()
    if not ret:
      break

    cv2.putText(frame, "Hello, World!", (100, 100), cv2.FONT_HERSHEY_SIMPLEX,
                1, (255, 0, 0), 1)
    out.write(frame)

cap.release()
out.release()

You can find details on this in our Getting Started Guide.

πŸ“˜ Documentation

πŸ” About the project

Vidformer is a highly modular suite of tools that work together; these are detailed here.

❌ vidformer is NOT:

  • A conventional video editor (like Premiere Pro or Final Cut)
  • A video database/VDBMS
  • A natural language query interface for video
  • A computer vision library (like OpenCV)
  • A computer vision AI model (like CLIP or Yolo)

However, vidformer is highly complementary to each of these. If you're working on any of the later four, vidformer may be for you.

File Layout:

License: Vidformer is open source under Apache-2.0. Contributions welcome.

Getting Started

Install

Using vidformer requires the Python client library, vidformer-py, and a yrden server which is distributed through vidformer-cli.

vidformer-py

pip install vidformer

vidformer-cli

🐳 Docker:

docker pull dominikwinecki/vidformer:latest
docker run --rm -it -p 8000:8000 dominikwinecki/vidformer:latest yrden --print-url

This launches a vidformer yrden server, which is our reference server implementation for local usage, on port 8000. If you want to read or save video files locally add -v /my/local/dir:/data and then reference them as /data in the code.

To use:

import vidformer as vf
server = vf.YrdenServer(domain="localhost", port=8000)

# or for cv2
import vidformer.cv2 as cv2
cv2.set_cv2_server(server)

Precompiled binary:

Precompiled binaries are available for vidformer releases.

For example:

wget https://github.com/ixlab/vidformer/releases/download/<version>/vidformer-cli-ubuntu22.04-amd64
sudo mv  vidformer-cli-ubuntu22.04-amd64 /usr/local/bin/vidformer-cli
sudo chmod +x /usr/local/bin/vidformer-cli
sudo apt install -y libopencv-dev libfdk-aac-dev

To use:

import vidformer as vf
server = vf.YrdenServer(bin="vidformer-cli")

or

export VIDFORMER_BIN='vidformer-cli'
import vidformer as vf
server = vf.YrdenServer()

Build from Sources

vidformer-cli can be compiled from our git repo with a standard cargo build.

This depends on the core vidformer library which itself requires linking to FFmpeg and OpenCV. Details are available here.

Getting Started - cv2

This is a walkthrough of getting started with vidformer OpenCV cv2 compatability layer.

⚠️ Adding cv2 functions is a work in progress. See the cv2 filters page for which functions have been implemented.

Installation

See Installation guide

Or you can Open In Colab.

⚠️ Due to how Colab networking works, vidformer can't stream/play results in Colab, only save them to disk. cv2.vidplay() will not work!

Hello, world!

Copy in your video, or use ours:

curl -O https://f.dominik.win/data/dve2/tos_720p.mp4

Then just replace import cv2 with import vidformer.cv2 as cv2. Here's our example script:

import vidformer.cv2 as cv2

cap = cv2.VideoCapture("tos_720p.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

out = cv2.VideoWriter("output.mp4", cv2.VideoWriter_fourcc(*"mp4v"),
                        fps, (width, height))
while True:
    ret, frame = cap.read()
    if not ret:
      break

    cv2.putText(frame, "Hello, World!", (100, 100), cv2.FONT_HERSHEY_SIMPLEX,
                1, (255, 0, 0), 1)
    out.write(frame)

cap.release()
out.release()

Stream the Results

Saving videos to disk works, but we can also display them in the notebook. Since we stream the results and only render them on demand this can start practically instantly!

First, replace "output.mp4" with None to skip writing the video to disk. Then you can use cv2.vidplay() to play the video!

import vidformer.cv2 as cv2

cap = cv2.VideoCapture("tos_720p.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

out = cv2.VideoWriter(None, cv2.VideoWriter_fourcc(*"mp4v"),
                        fps, (width, height))
while True:
    ret, frame = cap.read()
    if not ret:
      break

    cv2.putText(frame, "Hello, World!", (100, 100), cv2.FONT_HERSHEY_SIMPLEX,
                1, (255, 0, 0), 1)
    out.write(frame)

cap.release()
out.release()

cv2.vidplay(out)

⚠️ By default cv2.vidplay() will return a video which plays in a Jupyter Notebook. If running outside a jupyter notebook you can pass method="link" to return a link instead.

Getting Started - DSL

This is a walkthrough of getting started with vidformer-py core DSL.

Installation

See Installation guide

Hello, world!

⚠️ We assume this is in a Jupyter notebook. If not then .play() won't work and you have to use .save() instead.

We start by connecting to a server and registering a source:

import vidformer as vf
from fractions import Fraction

server = vf.YrdenServer(domain='localhost', port=8000)

tos = vf.Source(
    server,
    "tos_720p",     # name (for pretty printing)
    "https://f.dominik.win/data/dve2/tos_720p.mp4",
    stream=0,       # index of the video stream we want to use
)

print(tos.ts())
print(tos.fmt())

This will print the timestamps of all the frames in the video, and then format information: This may take a few seconds the first time, but frame times are cached afterwords.

> [Fraction(0, 1), Fraction(1, 24), Fraction(1, 12), Fraction(1, 8), ...]
> {'width': 1280, 'height': 720, 'pix_fmt': 'yuv420p'}

Now lets create a 30 second clip starting at the 5 minute mark. The source video is at at a constant 24 FPS, so lets create a 24 FPS output as well:

domain = [Fraction(i, 24) for i in range(24 * 30)]

Now we need to render each of these frames, so we define a render function.

def render(t: Fraction, i: int):
    clip_start_point = Fraction(5 * 60, 1) # start at 5 * 60 seconds
    return tos[t + clip_start_point]

We used timestamp-based indexing here, but you can also use integer indexing (tos.iloc[i + 5 * 60 * 24]).

Now we can create a spec and play it in the browser. We create a spec from the resulting video's frame timestamps (domain), a function to construct each output frame (render), and the output videos format (matching tos.fmt()).

spec = vf.Spec(domain, render, tos.fmt())
spec.play(server)

This plays this result:

Some Jupyter environments are weird (i.e., VS Code), so .play() might not work. Using .play(..., method="iframe") may help.

It's worth noting that we are playing frames in order here and outputing video at the same framerate we recieved, but that doesn't need to be the case. Here are some things other things you can now try:

  • Reversing the video
  • Double the speed of the video
    • Either double the framerate or sample every other frame
  • Shuffle the frames into a random order
  • Combining frames from multiple videos
  • Create a variable frame rate video
    • Note: .play() will not work with VFR, but .save() will.

Bounding Boxes

Now let's overlay some bouding boxes over the entire clip:

# Load some data
import urllib.request, json 
with urllib.request.urlopen("https://f.dominik.win/data/dve2/tos_720p-objects.json") as r:
    detections_per_frame = json.load(r)

bbox = vf.Filter("BoundingBox") # load the built-in BoundingBox filter

domain = tos.ts() # output should have same frame timestamps as our example clip

def render(t, i):
    return bbox(
        tos[t],
        bounds=detections_per_frame[i])

spec = vf.Spec(domain, render, tos.fmt())
spec.play(server)

This plays this result (video is just a sample clip):

Composition

We can place frames next to each other with the HStack and VStack filters. For example, HStack(left_frame, middle_frame, right_frame, width=1280, height=720, format="yuv420p") will place three frames side-by-side.

As a larger example, we can view a window function over frames as a 5x5 grid:

hstack = vf.Filter("HStack")
vstack = vf.Filter("VStack")

w, h = 1920, 1080

def create_grid(tos, i, N, width, height, fmt="yuv420p"):
    grid = []
    for row in range(N):
        columns = []
        for col in range(N):
            index = row * N + col
            columns.append(tos.iloc[i + index])
        grid.append(hstack(*columns, width=width, height=height//N, format=fmt))
    final_grid = vstack(*grid, width=width, height=height, format=fmt)
    return final_grid

domain = [Fraction(i, 24) for i in range(0, 5000)]

def render(t, i):
    return create_grid(tos, i, 5, w, h)

fmt = {'width': w, 'height': h, 'pix_fmt': 'yuv420p'}

spec = vf.Spec(domain, render, fmt)
spec.play(server)

This plays this result (video is just a sample clip):

Viewing Telemetry (and User-Defined Filters)

This notebook shows how to build custom filters to overlay data.

This plays this result (video is just a sample clip):

Concepts & Data Model

vidformer builds on the data model introduced in the V2V paper.

  • Frames are a single image. Frames are represented as their resolution and pixel format (the type and layout of pixels in memory, such as rgb24, gray8, or yuv420p).

  • Videos are sequences of frames represented as an array. We index these arrays by rational numbers corresponding to their timestamp.

  • Filters are functions which construct a frame. Filters can take inputs, such as frames or data. For example, DrawText may draw some text on a frame.

  • Specs declarativly represent a video synthesis task. They represent the construction of a result videos, which is itself modeled as an array.

    • Specs primairly contan domain and render functions.
      • A spec's domain function returns the timestamps of the output frames.
      • A spec's render function returns a composition of filters used to construct a frame at a spesific timestamp.
  • Data Arrays allow using data in specs symbolically, as opposed to inserting constants directly into the spec. These allow for deduplication and loading large data blobs efficiently.

    • Data Arrays can be backed by external data sources, such as SQL databases.

The vidformer Tools

vidformer is a highly modular suite of tools that work together:

  • vidformer-py: A Python 🐍 client for declarative video synthesis

    • Provides an easy-to-use library for symbolically representing transformed videos
    • Acts as a client for a VoD server (i.e., for yrden)
    • Using vidformer-py is the best place to get started
  • libvidformer: The core data-oriented declarative video editing library

    • An embedded video processing execution engine with low-level interfaces
    • Systems code, written in Rust πŸ¦€
    • You should use if: You are building a VDBMS or other multimodal data-system infrastructure.
    • You should not use if: You just want to use vidformer in your workflows or projects.
  • yrden: A vidformer Video-on-Demand server

    • Provides vidformer services over a REST-style API
    • Allows for client libraries to be written in any language
    • Serves video results via HLS streams
    • Designed for local single-tenant use
    • You should use if: You want to create faster video results in your workflows or projects.
    • Note that yrden servers may be spun up transparently by client libraries, so you might use yrden without realizing it.
  • igni: A planned scale-out Video-on-Demand server

    • Will allow for scalable and secure public-facing VOD endpoints

Client libraries in other languages: Writing a vidformer client library for other languages is simple. It's a few hundred lines of code, and you just have to construct some JSON. Contributions or suggestions for other languages are welcome.

Other VoD servers: We provide yrden as a simple reference VoD server implementation. If you want to scale-out deployments, multi-tenant deployments, or deep integration with a specific system, writing another VoD server is needed. (In progress work)

vidformer-py

PyPI version License

vidformer-py is a Python 🐍 interface for vidformer. Our getting started guide explains how to use it.

Quick links:

Publish:

 export FLIT_USERNAME='__token__' FLIT_PASSWORD='<token>'
flit publish

vidformer - Video Data Transformation Library

Crates.io Version License

(lib)vidformer is a core video synthesis/transformation library. It handles the movement, control flow, and processing of video and conventional (non-video) data.

Quick links:

About

  • It's written in Rust πŸ¦€
    • So it does some fancy parallel processing and does so safely
  • Uses the FFmpeg libav libraries for multimedia stuff
    • So it should work with nearly every video file ever made
  • Uses Apache OpenDAL for I/O
    • So it can access videos in a bunch of storage services
  • Implements filters using OpenCV

Building

This crate requires linking with FFmpeg, as detailed in the rusty_ffmpeg crate. We currently target FFmpeg 7.0.

Filters

Built-in Filters

While most applications will use user-defined filters, vidformer ships with a handful of built-in filters to get you started:

DrawText

DrawText does exactly what it sounds like: draw text on a frame.

For example:

DrawText(frame, text="Hello, world!", x=100, y=100, size=48, color="white")

BoundingBox

BoundingBox draws bounding boxes on a frame.

For example:

BoundingBox(frame, bounds=obj)

Where obj is JSON with this schema:

[
  {
    "class": "person",
    "confidence": 0.916827917098999,
    "x1": 683.0721842447916,
    "y1": 100.92174338626751,
    "x2": 1006.863525390625,
    "y2": 720
  },
  {
    "class": "dog",
    "confidence": 0.902531921863556,
    "x1": 360.8750813802083,
    "y1": 47.983140622720974,
    "x2": 606.76171875,
    "y2": 717.9591837897462
  }
]

Scale

The Scale filter transforms one frame type to another. It changes both resolution and pixel format. This is the most important filter and is essential for building with vidformer.

Arguments:

Scale(
    frame: Frame,
    width: int = None,
    height: int = None,
    pix_fmt: str = None)

By default missing width, height and format values are set to match frame. pix_fmt must match ffmpeg's name for a pixel format.

For example:

frame = Scale(frame, width=1280, height=720, pix_fmt="rgb24")

IPC

IPC allows for calling User-Defined Filters (UDFs) running on the same system. It is an infrastructure-level filter and is used to implement other filters. It is configured with a socket and func, the filter's name, both strings.

The IPC filter can not be directly invoked, rather IPC filters are constructed by a server upon request. This can be difficult, but vidformer-py handles this for you. As of right now IPC only supports rgb24 frames.

HStack & VStack

HStack & VStack allow for composing multiple frames together, stacking them either horizontally or vertically. It tries to automatically find a reasonable layout.

Arguments:

HStack(
    *frames: list[Frame],
    width: int,
    height: int,
    format: str)

At least one frame is required, along with a width, height and format.

For example:

compilation = HStack(left_frame, right_frame, width=1280, height=720, format="rgb24")

OpenCV/cv2 Functions

See vidformer.cv2 API docs.

⚠️ The cv2 module is a work in progress. If you find a bug or need a missing feature implemented feel free to file an issue or contribute yourself!

Legend:

  • βœ… - Support
  • πŸ”Έ - Support via OpenCV cv2
  • ❌ - Not yet implemented

Vidformer-specific Functions

  • cv2.vidplay(video2) - Play a VideoWriter, Spec, or Source
  • VideoWriter.spec() - Return the Spec of an output video
  • Frame.numpy() - Return the frame as a numpy array
  • cv2.setTo - The OpenCV Mat.setTo function (not in cv2)

opencv

ClassStatus
VideoCaptureβœ…
VideoWriterβœ…
VideoWriter_fourccβœ…
FunctionStatus
imreadβœ…
imwriteβœ…

opencv.imgproc

Drawing Functions:

FunctionStatus
arrowedLineβœ…
circleβœ…
clipLine❌
drawContours❌
drawMarker❌
ellipse❌
ellipse2Poly❌
fillConvexPoly❌
fillPoly❌
getFontScaleFromHeightπŸ”Έ
getTextSizeπŸ”Έ
lineβœ…
polylines❌
putTextβœ…
rectangleβœ…

opencv.core

FunctionStatus
addWeightedβœ…

User-Defined Filters

To implement a new user-defined filter (UDF) you need to host a filter server over a UNIX Domain Socket. The vidformer-py library makes this easy.

Filters take some combination of frames and data (string, int, bool) and return a single frame result. The vidformer project uses Python-style arguments, allowing ordered and named arguments (*args and **kwargs style).

To do this we define a new filter class and host it:

import vidformer as vf
import cv2

class MyFilter(vf.UDF):

    def filter(self, frame: vf.UDFFrame, name: str):
        """Return the result frame."""

        text = f"Hello, {name}!"

        image = frame.data().copy()
        cv2.putText(
		    image,
            text, 
            (100,100),
            cv2.FONT_HERSHEY_SIMPLEX,
            1,
            (255, 0, 0),
            1,
        )
        return vf.UDFFrame(image, frame.frame_type())

    def filter_type(self, frame: vf.UDFFrameType, _name: str):
        """Returns the type of the output frame."""
        return frame

mf_udf = MyFilter("MyFilter") # name used for pretty printing

my_filter = mf_udf.into_filter() # host the UDF in a subprocess, returns a vf.Filter

Now we can use our newly-created filter in specs: my_filter(some_frame, "vidformer").

There is a catch, UDFs currently only support rgb24 pixel formats. So invoking my_filter will need to convert around this:

scale = vf.Filter('Scale')

def render(t, i):
    f = scale(tos[t], pix_fmt="rgb24", width=1280, height=720)
    f = my_filter(f, "world")
    f = scale(f, pix_fmt="yuv420p", width=1280, height=720)
    return f

Roadmap

An unordered list of potential future features:

  • Igni - A scale-out multi-tenant vidformer server in the cloud

  • Supervision Integration

  • Full GPU Acceleration

  • WebAssembly Builds

  • WebAssembly user defined filters & specs

FAQ

What video formats does vidformer support?

In short, essentially everything. vidformer uses the FFmpeg/libav* libraries internally, so any media FFmpeg works with should work in vidformer as well. We support many container formats (e.g., mp4, mov) and codecs (e.g., H.264, VP8).

A full list of supported codecs enabled in a vidformer build can be found by running:

vidformer-cli codecs

Can I access remote videos on the internet?

Yes, vidformer uses Apache OpenDAL for I/O, so most common data/storage access protocols are supported. However, not all storage services are enabled in distributed binaries. We guarantee that HTTP, S3, and the local filesystem always work.

How does vidformer compare to FFmpeg?

vidformer is far more expressive than the FFmpeg filter interface. Mainly, vidformer is designed for work around data, so edits are created programatically and edits can reference data. Also, vidformer enables serving resut videos on demand.

vidformer uses the FFmpeg/libav* libraries internally, so any media FFmpeg works with should also work in vidformer.

How does vidformer compare to OpenCV/cv2?

vidformer orchestrates data movment in video synthesis tasks, but does not implement image processing directly. Most use cases will still use OpenCV for this.