Netflix, the video-on-demand streaming company, has seen a surge in its growing number of users every day as well as in the viewership of its TV shows. It is constantly striving to provide an enriching experience to its viewers. To keep pace with the ever-increasing demands of user experience, Netflix is introducing a collection of tools and algorithms to make its content more audience relevant.
AVA( Aesthetic Visual Analysis)- analyses large volumes of images obtained from video frames of a particular TV show to set as the title image for that show. Netflix understands that a more visually appealing title image plays an incredibly important role assisting a viewer find new shows and movies to watch.
How title images are selected normally
Usually, content editors had to go through tens of thousands of video frames for a show, to select a good title image. To give you a gist of the effort required- a single one-hour episode of ‘Stranger Things’, consists of nearly 86,000 static video frames. Imagine sieving through each one of these frames painstakingly to find the perfect title image that will not only connect with the viewers, but also give them a gist of the storyline.
To top it all up, the number of frames can go up to a million depending on the number of episodes in a show. This task of manually screening the frames is almost impossible and labor intensive, if not ineffective.
Additionally, the editors choosing the image stills require an in-depth expertise of the source content that they’re intended to represent. Considering Netflix has an exponentially increasing catalog of shows, this presents a very challenging expectation for the editors to surface meaningful images from videos.
Enter AVA, using its image classification algorithms for sorting the right image at the right time.
What is AVA?
The ever-growing number of images on the internet space has led to challenges in its processing and classification. To address this concern, a research team from University of Barcelona, Spain in collaboration with Xerox corporation has developed a method called Aesthetic Visual Analysis (AVA) as a research project. The project contains a vast database of over 2.5 lakh images combined with metadata such as
- aesthetic scores for images
- semantic labels for more than 60 classifications of images and many other characteristics.
Using statistical concepts like standard deviation, mean score and variance, AVA rates images. Based on the distributions computed from these statistics, they assess the semantic challenges and choose the right images for the database. AVA primarily alleviates the issues of extensive benchmarking and trains more images. They also enable images to get a better aesthetic appeal. Computing performance can be significantly optimised to have lesser impact on the hardware.
You can get more insights by reading the Research paper.
The ‘AVA’ approach used at Netflix
The process takes place in 3 steps:
- AVA starts by analysing images obtained through the process of frame annotation. This includes processing and annotating many different variables on every individual frame of video to best derive what the frame contains, and to understand its importance to the story. To keep up pace with the growing catalog of content, Netflix uses the Archer framework to process videos more efficiently. Archer splits the video into very tiny bits to aid parallel video processing.
- After the frames are obtained, they are subjected to a series of image recognition algorithms to build metadata. Metadata is further classified as visual, contextual and composition metadata. To give you a brief overview-
- Visual Metadata: For brightness, sharpness and color
- Contextual Metadata: This is a combination of elements that are combined to derive meaning from the actions or movement of the actors, objects and camera in the frame. Eg: face detection, Motion estimation, Object Detection and camera shot identification
- Composition Metadata: For intricate image details based on core principles in photography, cinematography and visual aesthetic design such as depth of field and symmetry.
- Choosing the right Picture!
The ‘best’ image is chosen considering three important aspects– the lead actors, visual range and sensitivity filters.
- Emphasis is given first to lead actors of the show since they make a visual impact. In order to identify the key character for a given episode, AVA utilizes a combination of face clustering and actor recognition to filter main characters from secondary characters or extras.
- The next thing, is the diversity of the images present in the video frames which includes camera positions, image details such as brightness, color, contrast to name a few. Keeping these in mind, image frames are easy to group based on similarities. This helps in developing image support vectors. The vectors primarily assist in designing an image diversity index where all the relevant images collected for an episode or even a movie can be scored based on visual appeal.
- Sensitive factors such as violence, nudity and advertisements are filtered and are allotted low priority in the image vectors. This way they are screened out completely in the process.
Source: Netflix Blog
What’s in this for Netflix and its users?
Netflix’s decision to use AVA will not only save manual labour, but also reduce the cost involved in having manpower source through millions of images in order to get that one perfect shot.
This unique approach will help in obtaining meaningful images from video and thus enable creative teams to invest time in designing stunning artwork .
As for its users, a good title image means establishing a deeper connection to the show’s characters and storyline, thus improving their overall experience.
To understand the intricate workings of AVA, you can read Netflix engineering team’s original post on this topic on Medium.