Netflix uses a variety of tools to do data analysis. One of the big ways that data scientists and engineers at Netflix interact with their data is through Jupyter notebooks. In addition to providing execution environments to users, Netflix invests in various parts of the Jupyter ecosystem and tooling. They are “reimagining what a notebook can be, who can use it, and what they can do with it.”
Netflix aims to provide personalized content to their 130 million viewers. For this every day more than 1 trillion events are written into a streaming ingestion pipeline. To support this, they’ve built an industry-leading data platform which is flexible, powerful, and complex. There are so many diverse users of this platform, such as analytics engineers, data engineers, and data scientists, requiring different sets of tools and languages. To help the platform scale, they wanted to minimize the number of tools and the solution to this was the open-source tool: Jupyter notebooks.
Why Jupyter notebook is so compelling for Netflix?
These are the functionalities provided by notebook that benefits Netflix’s data scientists and engineers:
- Standard messaging API: The Jupyter protocol provides a standard messaging API with the kernels that act as computational engines. It separates where the content is written and where the content is executed. This makes it language agnostic.
- Editable file format: It provides an editable file format that stores the code and results together.
- Web-based UI: It is web-based which helps interactively writing and running code as well as visualizing outputs.
How Netflix uses Jupyter Notebooks?
The following are some of the use cases they use Jupyter notebooks for:
Data access: Notebooks were first introduced for workflows and their adoption grew among the data scientists. Seeing this, Netflix decided to leverage its versatility and architecture for general data access. Notebooks provide an user-friendly interface for interactively running code, exploring the outputs, and visualizing data all from a single cloud-based development environment.
Notebook Templates: They introduced parameterized notebooks, which allow the use of parameters in the code and take values as input at runtime. These templates help:
- Data scientists to run an experiment with different coefficients and summarize the results
- Data engineers to execute data quality audits
- Data analysts to share prepared queries and visualizations
- Software engineers to email the results of a troubleshooting script
Scheduling notebooks: Next they are using notebooks for creating a unifying layer for scheduling workflows. Notebooks are used for interactive work and allows smooth move to scheduling that work to run recurrently. Many users create an entire workflow in a notebook and just copy/paste it into separate files for scheduling when they’re ready to deploy it.
Notebook infrastructure: The three fundamental components of the infrastructure are: storage, compute, and interface.
- Storage: The Netflix Data Platform is made of Amazon S3 and EFS for cloud storage, which notebooks treat as virtual filesystems. Each user has a home directory on EFS containing a personal workspace for notebooks. This workspace is for storing any notebook created or uploaded by a user. When a user launches a notebook interactively, all the reading and writing happens at the workspace.
- Compute: All the jobs on the data platform run on containers including queries, pipelines and notebooks. A container with reasonable default resources is provisioned when a user launches a notebook. Users can request more resources if they find that the provided resources are not enough. A unified execution environment with a prepared container image is provided, which has common libraries and an array of default kernels preinstalled. The orchestration and environments are managed with Titus, their container management platform.
- Interface: They are using nteract, a React-based frontend for Jupyter notebooks, which emphasizes simplicity and composability as core design principles.They’re also introducing native support for parameterization, which makes it easier to schedule notebooks and create reusable templates.
Netflix is planning to make investments in both the frontend and backend to improve the overall notebook experience. This year they are also sponsoring JupyterCon.
To read more about how Jupyter is offering value to Netflix read Netflix’s original post at Medium.