Gasp! Salesforce project transmogrifAIs data into predictions

Gasp! Salesforce project transmogrifAIs data into predictions

Cloudy CRM outfit Salesforce has open-sourced its automated machine learning library, TransmogrifAI, to assist others in building customer-specific ML models.

The library is presented as an aid to developers wanting to train machine learning models with little manual interaction or to build machine learning workflows. It was designed to build and deploy large numbers of personalised models from heterogenous structured data – most useful to those working in large organisations with divergent but organised customer data.

TransmogrifAI is written in Scala and built on top of cluster-computing framework Apache Spark because the project offers primitives to join and aggregate distributed records, which is necessary to handle the variations in size of the data available from Salesforce’s single customers. Spark also includes a streaming component, which allows TransmogrifAI to serve finished models as batches or streams. The library uses the transformer and estimator abstractions of SparkML pipelines and adds a third, features.

A feature in TransmogrifAI is described as a type-safe pointer to a column in a DataFrame, which contains the column’s name, the type of data stored there, and information on how it was derived. It is shareable and used as the main primitive for developers, so they feel more reminded of working with variables when manipulating or defining features.

From data to model in four steps

The purpose of TransmogrifAI is to automate basic machine learning steps like data cleansing, feature engineering, and model selection with as few lines of code as possible – an example can be found in an introductory blog post. The library’s workflow consists of five elements. First comes feature interference, which basically means preparing the available data, extracting information useful to make predictions, and putting the predictor along with its response signal into strongly typed features. The typing part is important because it allows checks during every step to find errors early, and increases transparency around inputs and outputs.

As a second step, the library automatically transforms feature types into numeric vectors. The project supports a hierarchy of feature types, so that considering distinctive sub-types is an option. Afterwards it removes features with little predictive power to reduce the data’s dimensions. On the data left, TransmogrifAI runs tests with different machine learning algorithms to choose the one coming up with the best, most reliable model. Every component is coupled to a hyperparameter optimisation layer, which automatically tunes the parameters available in the steps mentioned (sampling rates, number of binary variables, etc).

The project’s first official version is 0.3.4 , which seems a far cry from a finished release. But since it’s supposed to be part of Salesforce’s Einstein platform services, which incorporates deep learning into the company’s CRM offering, it should be somewhat stable. Examples to get started or just give it a try can be found in the TransmogrifAI wiki. The included estimators are parameterised and can therefore be set and tuned if needed. Options to specify custom transformers and estimators are in place as well.

TransmogrifAI is licensed under the BSD 3-clause.