Machine Video Understanding

3 min readJul 17, 2020

The Eluvio Content Fabric is a new global substrate for management and distribution of premium video content. In recent months it has gained attention because it works without traditional services and designs — without CDNs, cloud stacks, transcoding services or databases — to serve video and experiences directly from source, just-in-time, with low latency streaming and without making or copying files.

A recent innovation of the Content Fabric is a universal content tagging service and API for video and audio understanding. This service consists of native tagging of audio and video content during ingest using original ML models, and a distributed search capable of querying content based on this data, including people, places, OCR/text, objects, kinetic activities, and brands. The API allows Content Providers to create dynamic, personalized content for linear sequences / channels or VoD programming.

Through its just-in-time programmatic capabilities, the Content Fabric can be used to stream frame accurate dynamic VoD and Live content “stitched” in real time. The selection of what is stitched can be based on any metadata (itself part of the content object), including the ML tags. For example, a broadcast channel could “switch” two feeds such as prime programming and a sport feed, triggered when the tagging detects the game resuming, or when a particular player appears, for personalization.

The Tagging API could also be used to provide programmatic automation of many manual media curation tasks, including automatic identification of credits via OCR, automatic creation of “posters” from key frames, automatic identification of scene transition points and identification of key brands or activities for advertising insertion, and automatic identification of localization requirements (e.g. banned scenes). For example, in the below content the tags respond to the query “find all scenes with Reid Scott driving a car”.

These key points detected in the media can be persistently saved to the content itself, via the Content Fabric’s APIs, and then used to drive dynamic video operations such as ad insertion, automatic metadata, automatic clipping/editing/track substitution off line or in real-time.

The Tagger API is built from the ground up by our team and trained on both large scale public data sets, and some customer-specific data sets, and is now deployed in the production Content Fabric for automatic tagging of content on ingest with the opportunity for iterative, continuous training.

Our early work has shown that domain specific training greatly improves the accuracy. For example, we have partnered with a large public broadcaster to further train our mix-of-experts approach to achieve very accurate retrieval of news clips using a video-text retrieval pipeline fine-trained on 60,000 shot lists.

State of the Art — Performance of Mix of Experts VTR models trained on generic data set (6000 YouTube clips)

Fine-grained training with well-labeled news clip archive (60,000 shot lists) significantly improves performance of base VTR model trained on generic data set (0.8 vs 0.5)

We are combining the per content object tags (frame and object level) into our Content Fabric search API to allow for wider testing and use of the capability throughout the media libraries in the Content Fabric. We welcome partnerships with media companies that would like to collaborate on fine-grained domain specific training. For more information please email us at info@eluv.io

Machine Video Understanding

Written by Eluvio Tech

No responses yet