Salesforce Einstein team open sources TransmogrifAI, their automated machine learning library

Salesforce has open sourced TransmogrifAI, their end-to-end automated machine learning library for structured data. This library is currently used in production to help power Salesforce Einstein AI platform. TransmogrifAI enables data scientists at Salesforce to transform customer data into meaningful, actionable predictions. Now, they have open-sourced this project to enable other developers and data scientists to build machine learning solutions at scale, fast.

TransmogrifAI is built on Scala and SparkML that automates data cleansing, feature engineering, and model selection to arrive at a performant model. It encapsulates five main components of the machine learning process:

salesforce-open-sources-transmogrifai-automated-machine-learning-library-img-0

Source: Salesforce Engineering

Feature Inference:

TransmogrifAI allows users to specify a schema for their data to automatically extract the raw predictor and response signals as “Features”. In addition to allowing for user-specified types, TransmogrifAI also does inference of its own. The strongly-typed features allow developers to catch a majority of errors at compile-time rather than run-time.

Transmogrification or automated feature engineering:

TransmogrifAI comes with a myriad of techniques for all the supported feature types ranging from phone numbers, email addresses, geo-location to text data. It also optimizes the transformations to make it easier for machine learning algorithms to learn from the data.

Automated Feature Validation:

TransgmogrifAI has algorithms that perform automatic feature validation to remove features with little to no predictive power. These algorithms are useful when working with high dimensional and unknown data. They apply statistical tests based on feature types, and additionally, make use of feature lineage to detect and discard bias.

Automated Model Selection:

The TransmogrifAI Model Selector runs several different machine learning algorithms on the data and uses the average validation error to automatically choose the best one. It also automatically deals with the problem of imbalanced data by appropriately sampling the data and recalibrating predictions to match true priors.