MoViNet

Mobile Video Network, or MoViNet, is a novel technology that allows for efficient video network computation and memory. It is designed to work on streaming videos for online inference. The technique includes three main elements that optimize efficiency while lowering the peak memory usage of 3D Convolutional Neural Networks (CNNs).

Neural Architecture Search

The first step in developing MoViNet involved creating a video network search space and employing neural architecture search. The goal was to generate efficient and diverse 3D CNN architectures. The search space is a combination of convolutions, pooling, activation functions, and number of filter channels. A high-performance computing cluster was employed to generate efficient neural architecture using the search space.

Stream Buffer Technique

A stream buffer technique was also introduced that decouples memory from video clip duration. It allows 3D CNNs to incorporate streaming video sequences that vary in length for both inference and training. With this technique, an arbitrary length of the streaming video can be embedded in a small constant memory footprint.

Ensembling Technique

Lastly, an ensembling technique is used to enhance the efficiency further without sacrificing accuracy. This technique is simple and involves combining multiple weak classifiers to form a strong classifier. This way, a robust and accurate system can be built.

These three techniques used in MoViNet not only improve efficiency but also enable superior prediction quality, making it an ideal choice for multimedia-based applications like video-based classification/manipulation, action recognition, and more. The overall efficiency of MoViNet makes it advantageous in various fields having limited computational resources, from the internet of things to edge devices, and mobile platforms.

In summary, MoViNet is a computation and memory-efficient video network that enables streaming video analysis, video-based manipulation, and action recognition. The development of MoViNet through neural architecture search, stream buffer technique, and ensembling technique unleash its full potential. It is ideal for applications that require robust prediction quality with limited computational resources.