Today the TensorFlow team announced the launch of TensorFlow Data Validation (TFDV), an open-source library that enables developers to understand, validate, and monitor their machine learning data at scale.
Why is TensorFlow Data Validation introduced?
While building machine learning algorithms a lot of attention is paid on improving their performance. However, if the input data is wrong, all this optimization effort goes to waste. Understanding and validating small amount of data is easy, you can do it manually as well. However, in the real-world this is not the case. Data in production is huge and often arrives continuously and in big chunks. This is why, it is necessary to automate and scale the tasks of data analysis, validation, and monitoring.
What are some features of TFDV?
TFDV is part of the TensorFlow Extended (TFX) platform, a TensorFlow-based general-purpose machine learning platform. It is already being used by Google every day to analyze and validate petabytes of data.
TFDV provides some of the following features:
- It can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.
- It includes tools such as Facets Overview, which provides a visualization of the computed statistics for easy browsing.
- Data-schema can be generated automatically to describe expectations about data such as required values, ranges, and vocabularies. Since writing a schema can be a tedious task for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.
- You can inspect the schema with the help of schema viewer.
- You can identify anomalies such as missing features, out-of-range values, or wrong feature types with Anomaly detection.
- Provides an anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.