Yesterday, the Netflix team announced to open-source Metaflow, a Python library that helps scientists and engineers build and manage real-life data science projects.
The Netflix team writes, “Over the past two years, Metaflow has been used internally at Netflix to build and manage hundreds of data-science projects from natural language processing to operations research.”
Metaflow was developed by Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to deep learning. It provides a unified API to the infrastructure stack required to execute data science projects, from prototype to production.
Metaflow integrates with Netflix’s data science infrastructure stack
Models are only a small part of an end-to-end data science project. Production-grade projects rely on a thick stack of infrastructure. At the minimum, projects need data and a way to perform computation on it. In a business environment like Netflix’s typical data science project, the team touches upon all the layers of the stack depicted below:
Source: Netflix website
Data is accessed from a data warehouse, which can be a folder of files, a database, or a multi-petabyte data lake. The modeling code crunches the data executed in a compute environment and a job scheduler is used to orchestrate multiple units of work.
Then the team architects the code to be executed by structuring it as an object hierarchy, Python modules, or packages. They version the code, input data, and produce ML models. After the model has been deployed to production, the team faces pertinent questions about model operations for example;
- How to keep the code running reliably in production?
- How to monitor its performance?
- How to deploy new versions of the code to run in parallel with the previous version?
Additionally at the very top of the stack there are other questions like how to produce features for your models, or how to develop models in the first place using off-the-shelf libraries.
In this Metaflow provides a unified approach to navigating the stack. Metaflow is more prescriptive about the lower levels of the stack but it is less opinionated about the actual data science at the top of the stack. Developers can use Metaflow with their favorite machine learning or data science libraries, such as PyTorch, Tensorflow, or SciKit Learn.
Metaflow allows you to write models and business logic as idiomatic Python code. Internally, Metaflow leverages existing infrastructure when feasible. The core value proposition of Metaflow is its integrated full-stack, human-centric API, rather than reinventing the stack itself.
Metaflow on Amazon Web Services
Metaflow is a cloud-native framework which it leverages elasticity of the cloud by design — both for compute and storage. Netflix is one of the largest users of Amazon Web Services (AWS) and have accumulated plenty of operational experience and expertise in dealing with the cloud. For this open-source release, Netflix partnered with AWS to provide a seamless integration between Metaflow and various AWS services.
Metaflow comes with built-in capability to snapshot all code and data in Amazon S3 automatically, a key value proposition for the internal Metaflow setup. This provides data science teams with a comprehensive solution for versioning and experiment tracking without any user intervention, core of any production-grade machine learning infrastructure. In addition, Metaflow comes bundled with a high-performance S3 client, which can load data up to 10Gbps.
Additionally Metaflow provides a first-class local development experience. It allows data scientists to develop and test code quickly on laptops, similar to any Python script. If the workflow supports parallelism, Metaflow takes advantage of all CPU cores available on the development machine.
How is Metaflow different from existing Python frameworks
On Hacker News, developers discuss how Metaflow is different than existing tools or workflows. One of them comments, “I don’t like to criticise new frameworks / tools without first understanding them, but I like to know what some key differences are without the marketing/PR fluff before giving one a go.
For instance, this tutorial example here does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.
Is the main feature the fact I can quickly put my workflows into the cloud?”
Someone from the Metaflow team responds on this thread,
“Here are some key features:
– Metaflow snapshots your code, data, and dependencies automatically in a content-addressed datastore, which is typically backed by S3, although local filesystem is supported too. This allows you to resume workflows, reproduce past results, and inspect anything about the workflow e.g. in a notebook. This is a core feature of Metaflow.
– Metaflow is designed to work well with a cloud backend. We support AWS today but technically other clouds could be supported too. There’s quite a bit of engineering that has gone into building this integration. For instance, using the Metaflow’s built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily.
– We have spent time and effort in keeping the API surface area clean and highly usable. YMMV but it has been an appealing feature to many users this far.”