In the wake of COVID, video streaming is no longer a fun diversion. Organizations are depending on it to keep their workforce moving… and parents are counting on it to keep their kids from going into all-out rebellion mode during lockdowns. We’re all familiar with the hiccups we experience using streaming platforms, so the application of Deep Learning to video encoding and streaming promises to be an interesting frontier.
Convolutional Neural Networks (CNNs) are a form of Deep Learning – machine learning designed to mimic the human brain by creating multiple layers of ‘neuron’ connections based on weighted probabilities – that is commonly used in image recognition. Each neuron represents a combination of features from a dataset, which are activated for prediction through sigmoid, threshold and rectifier functions.
According to researchers from streaming technology company Bitmovin and Athena Christian Doppler Pilot Laboratory (associated with the University of Klagenfurt), CNNs may offer a solution to the strained bandwidth and other performance issues that currently create problems for viewers and streaming companies alike.
In a recent paper presented at the IEEE International Conference on Communications and Image Processing (VCIP), Christian Timmerer, Ekrem Cetinkaya, Hadi Amirpour, and Mohammed Ghanbari proposed the use of convolutional neural networks (CNNs) to speed up the encoding of multiple representations (videos are stored in versions or ‘representations’ with multiple sizes and qualities, allowing the player to choose the most suitable one based on network conditions).
According to the paper, currently current most common approach for delivering video over the Internet – HTTP Adaptive Streaming(HAS) – poses limits in its ability to encode the same content at different quality levels, creating a challenge for content providers and a poor experience for viewers. Fast multirate encoding approaches leveraging CNNs, they say, have the potential to accelerate this process.
Waiting for the Slowest Person at the Table to Finish Their Dinner
According to Timmerer et al. most existing methods cannot accelerate the encoding process during parallel encoding because these approaches tend to use the highest quality representation as the reference encoding (or frames of a compressed video used to define future frames). Thus the process is delayed until the highest quality representation is completed, which creates many of the encoding bottlenecks we experience.
Essentially, it’s like you’re telling everyone at the dinner table that no one can leave until the slowest eater finishes their plate. How do you fix the problem? According to Timmerer et al., you time the dinner party based on the first person to finish, rather than the last.
FaME-ML uses CNNs to predict the split decisions on the subdivisions of frames – square-sized blocks referred to as CTUs – for multirate encoding. Since the lowest quality representation has the minimum time-complexity, it is chosen as the reference encoding. This stands in contrast to current techniques in which the representation with maximum time-complexity is chosen as the reference encoding.
According to the paper, the FaME-ML (leveraging CNN) achieved ROC-AUC scores of 0.79, 0.81, and 0.77 ROC-AUC for depth 0 and depth 1 classifications. In other words, the model offered significant improvement. Additionally, FaME-ML achieved around 41% reduction in overall time-complexity in parallel encoding.