#### AI

# Video Scene Detection Using Optimal Sequential Grouping

September 13, 2018 | Written by: IBM Research Editorial Staff

Categorized: AI | IBM Research-Haifa

Share this post:

Our team at IBM Research recently developed a new approach for automated video scene detection. Videos are used today for everything from entertainment and marketing to knowledge-sharing, news, and social journaling. Automated scene detection can help consumers and enterprises utilize this video content in new ways.

Video scene detection is the task of temporally dividing a video into semantic scenes. A scene is defined as a series of consecutive shots which depicts some high-level concept or story (where each shot is a series of frames taken from the same camera at the same time). In Hollywood films, for example, different scenes may consist of a car chase scene or a relaxed picnic on a beach. In a talk show or news broadcast, a scene might be defined by a specific topic that was discussed.

Once a video has been split into scenes, each scene can be tagged and enriched with metadata—based on what is said, where it occurs, what activity is taking place, the setting, the background—in order to augment the video with consumable information. Dividing a video into scenes and using AI to label these segments enables the creation of an inverted index for video content that is searchable.

We formulated the scene detection task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. We described the technique in our paper, Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection (written by Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay), at the 2018 ACM International Conference on Multimedia Retrieval. We also published our dataset.

# Unsupervised video scene detection

A common approach for video scene detection is to equate the problem to a standard clustering task. But such an approach neglects an important element, which is the inherent temporal structure of the video. Clustering shots independently of their location and order in the video can lead to marking a scene as a non-contiguous segment, in contrast to the definition of a scene as a temporally continuous entity.

# Unsupervised scene detection as an optimization process

To overcome these issues, we formulate the problem of video scene detection as an optimization process of sequential grouping. We choose optimal locations for division based on the pairwise distances between shot features: after applying shot boundary detection, we create a feature vector for each shot using both visual and audio features. Using a distance metric, we measure the distance between each pair of shots, resulting in a distance matrix.

The intuition is that shots within the same scene will have a low distance value, and shots from different scenes will result in a large distance value. Choosing the features correctly will likely result in a diagonal-block-like structure where each block corresponds to a scene. Obviously, for real videos the distance matrix is not as ideal, but the block structure can still be observed.

Essentially the problem definition can be simplified to finding the optimal divisions along the diagonal. The approach can then be to define a cost function that can measure the quality of a given division and then optimize this cost.

# A new normalized cost function for measuring the quality of the division

When attempting to decide on the cost function, one intuitive way to measure the quality of a division is with an *additive cost function *(*H _{add}*) that sums up all the values in the blocks on the diagonal.

In the example, we would sum the values in blocks 1-5.

Another option is using an *average cost function* (*H _{avg}*) which sums up the average value of each block on the diagonal.

The above cost functions (*H _{add}*

_{, }

*H*) are biased due to inherent mathematical properties (see next paragraph). Therefore, in our paper we introduce a

_{avg}*normalized cost function*(

*H*) which provides the desired results by measuring the average value of the entire blocks on the diagonal.

_{nrm}Let’s examine an experiment that highlights the aforementioned mathematical bias. Suppose we have a synthetic distance matrix with values that are uniformly distributed and are interested in a division into two. In such a matrix all points of division should be equally acceptable and so we would expect the cost function to have a constant response over the point of division. In practice we see in Figure 7 (which denotes the values of the cost function when dividing the matrix into two as a function of the point of division) that only *H _{nrm}* results in the expected behavior.

Keeping the cost function mathematically unbiased is especially important for situations where the block structure is less evident. Figure 8 shows a synthetic distance matrix that has a slight inclination for the division of two scenes (highlighted by the yellow rectangles) and only *H _{nrm}* allows the identification of the subtle inclination for the division.

# Optimal division using an efficient optimization scheme

Given a cost function, we use dynamic programming to find the division that globally minimizes the cost. Dynamic programming is a method to solve complex problems by breaking them down into subproblems and conserving their solutions. Using dynamic programming with our normalized cost function is especially difficult since the function examines the entire block diagonal even when inspecting the division of a sub-problem (because of the denominator). To overcome this, we developed a unique scheme for solving the optimization problem, which is described in the paper.

# IBM Watson Media Video Enrichment

Although the task of video scene detection has been studied for several years, IBM is one of the first to bring a solution to market, as part of the IBM Watson Media Video Enrichment product. In Video Enrichment, a key component for adding insights and searchable metadata for videos is the video scene detection technology which divides the video into semantic sections. For these scenes, the visual and audio concepts can be aggregated to create an electronic table of contents for the video.

Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection

Daniel Rotman, Dror Porat, Gal Ashour and Udi Barzelay

### Linear-Complexity Earth Mover’s Distance Approximations for Efficient Similarity Search

Measuring distances between entities is a fundamental function that powers many mission-critical applications across various sectors. Search, data mining, data discovery, clustering and classification all rely on discriminative distance measures between data objects. In addition to being accurate, today’s vast databases demand distance measures that are also scalable, i.e. can be computed efficiently even across […]

### AI Enables Foreign Language Study Abroad, No Travel Required

A student learning to speak Mandarin wanders into a marketplace on the streets of China on a sunny summer afternoon. Before long, two vendors approach and begin hawking products, trying to outbid one another. The student must now grasp what’s being said and formulate an appropriate response using proper pronunciation to avoid being misunderstood. It’s […]

### Text2Scene: Generating Compositional Scenes from Textual Descriptions

At CVPR 2019, IBM researchers introduce techniques to interpret visually descriptive language to generate compositional scene representations from textual descriptions.