INSIGHTS

Key Steps to build a Machine Learning Platform - Data Science Product Manager Perspective

Anshuk Pal Chaudhuri
September 2, 2019

Introduction

In the current industry, where Data is the new oil, every organisation is trying to make sense out of data and help to catalyse their business functions. Needless to say, it’s also witnessed that the organisations are also trying to create the right methodology to have data being processed, so they can be utilised in the right as part of the overall cognitive journey.

The customer support industry follows technological advancement with keen interest. The customer functions for giant financial institutions have been early adopters of disruptive technologies like Artificial Intelligence.

Also, what this implies, hiring the right team of data scientists and engineers, and creating the models and algorithms, mostly specific to the type and structure of data, considering the complexities of the domain specific functions.

If you are already profitable with a human-driven process, you likely have the opportunity to build an incredible technology platform on top of that process. In this case, Machine Learning is an operating leverage force multiplier for both your core business as well as a new business as a Technology Platform.

The article would help to understand the key steps and takeaways, w.r.t. what would one product manager need to consider in order to build a generic machine learning platform, i.e. automating the machine learning process. This process would accelerate the complete machine learning building lifecycle in a usable manner, assisting the business team to understand the nuances of the Data science process in a more consumable business way.

Business Problem and Product Approach

Customer Success Manager, one would need to (to mention a few):

Manage customer’s intent (e.g. churn) to improve customer satisfaction

Identifying issues wherever they rise is the first step to resolving them

How do I help customers reach the right representative in a shorter space of time, i.e how to reduce call duration periods, repeat calls or call abandonment rates by unhappy callers

How to determine the right time and number to call to reach potential prospects

As product manager, one would try to create a platform/product to answer (to mention a few):

Customer 360 Degree View

Analyze and Understand Customer Pains

Predict the optimal approach to answers calls for higher customer satisfaction rate

Understand which factor which has the highest impact for increased call duration

Key Steps for a Product Manager and Design Thought Process

Understand the Data Analytics problem

The business problem, based on the need and understanding, need to be converted into an analytics problem, primarily, can be classified into three parts:

Descriptive analytics and

Predictive analytics

Prescriptive Analytics

In this article, we will restrict our discussion to Descriptive and Predictive Analytics

Descriptive analytics consist of features like:

Univariate analysis – Only one variable can be taken into consideration for analysis which includes both summary statistics and visualization.

Bivariate analysis – Two variables can be taken into consideration for analysis which includes both summary statistics and visualization.

Multivariate analysis: Multiple variables can be taken for correlation analysis which includes both summary statistics and visualization.

Predictive analytics problem lies here is both classification and regression. Depending upon the problem statement and target response variable, it will differ. For an example, if the response variable is a binary variable like ‘Call-Success’ which has 2 category,- ‘Yes’ and ‘No’, then it is a binomial classification problem and if the response variable is a continuous one which takes numeric values, such as ‘Call Duration’ or ‘No. of Agents’ then it will be a regression problem.

Approach and Features:

User can upload dataset: Dataset (supporting csv, any row and column) can be uploaded from local server into ML platform which will be eventually saved into a centralized object storage platform.Gradually, the data source can be extended to RDBMS, JSON, etc
Importing dataset: Saved dataset can be imported inside platform cache like Redis memory via ML-API.
Setup: Imported dataset is now ready for parse setup which shows all the columns and data types of the dataset. That is nothing but describing the basic structure of the data
Parse: Next, parse is to be done. Here, an option to change the names and Data types of all the columns is allowed and save the updated data-frame into cloud storage in CSV format. In the parse page, there can be another feature that can be implemented which is, – double check of numeric variable. If the data type of any column is detected as numeric by Platform (e.g. a text like 90.90% is detected as numerics), but the values inside the column are not numeric (in this case %), then a message will be popped up showing those lists of columns which are to be rechecked. If everything goes fine, then parsing can be done and it will show the summary statistics such as min, max, median, no of missing values, no of zeros, cardinality etc.
Split frame: Dataset can be split into 2 or 3 subset such as train, validate and test data for predictive analytics.
Descriptive analysis: For all univariate, bivariate and multivariate analysis, data visualization (Heatmap, Vertical bar chart, Scatter plot, Frequency plot, Histogram etc) can be implemented along with some basic summary statistics. Variable importance inside the descriptive analytics is used to find the percentage of impact of the predictors upon the target variable. In the multivariate, there is an option to choose at most 20 continuous variables and their correlation (Heatmap representation). Additional features such as filter and drop could be there to juggle with the correlation values.
Apart from these, features like trend, seasonality, frequency and detection of anomalies can also be designed.
Trend: A trend is the distribution of data. It can have different types trend such as increasing and decreasing. For an example, we may see that the call duration over the year for a certain type of issue category either has an increasing trend or has a decreasing trend.
Seasonality: Seasonality tells us whether the distribution of Data is showing some specific behaviour for any specific time interval. For clarity, let’s take an example of call duration. We may have seasonality out of it because it can be observed that customes may be tending to have more call duration over a certain campaign period.
Frequency: The term frequency distribution is generally used for categorical variables. Let’s take an example of a categorical variable called ‘grade’. Now it can be A,B,C,D etc. Frequency distribution tells us how many counts or occurrences are there for each level of the category ‘grade’. Although the frequency distributions count for each category in the data, we can calculate the frequency of a numeric variable too. Then we need to bin or bucket the whole variable into some number of deciles and count the frequency under each bucket.
Anomaly: An anomaly or outlier is one of those data points out of all the data which behave differently from the other data points. That you need to decide based on business knowledge. If the user thinks that those data points are outliers, then he/she can drop those from the dataset to obtain a better model and understanding of the data. The ML platform can suggest some data points which can possibly be outliers so that those data points can be reviewed.
Algorithm used to detect the possible outlier is – KNN (K – Nearest Neighbors). And the outlier fraction value can be considered – 0.05 or 5%.

Predictive Analysis

The predictive analysis is subdivided into two parts: Model building and Prediction

Model building: Dataset is to be imported first from the cloud and then respective algorithm has to be chosen based on the type of the response variable (whether it is continuous or categorical).

Algorithms used: Algorithms which have been used to build the models are – Random forest, Generalized linear modelling, XGBoost and Gradient boost. One may need to choose respective distribution family for the corresponding algorithm. Binomial, Multinomial and Gaussian are the families to be chosen. First two are applicable for GLM only. One can choose Gaussian for Random forest, Gradient boost, XGBoost and GLM as well.

Validation metrics and charts: R squared can be considered as the model metric and fit chart can be implemented for visualization when the response variable is a continuous numeric variable. AUC or area under the curve can be taken as validation metric along with corresponding confusion matrix (Heatmap) when the response is a categorical binary variable. Similarly for multinomial response, confusion matrix can be represented along with true positive rate as accuracy measure. After the model is built, the platform provides an option to save the model into cloud so that the model can be reused in the prediction.

Prediction: Datasets and the corresponding models have to be imported first from the cloud, which are already built during the model building. Now one can apply the saved model onto a completely new dataset. In this way, the model can be generalized. It also provides an opportunity to save the prediction result as well. All the saved data, the model name, name of the graphs etc,- everything can be found inside the ‘Workspace’ as separate transactions. This is a very useful feature to keep track the number operations that have been performed inside the platform for a particular user. Trend, Seasonality, Frequency and Anomaly features can also be implemented in the predictive analytics as well.

Implementation methodology & challenges

Core python can be used to implement the gateway of the micro services / APIs, which eventually invoke the REST API.
List of microservices:
- Import dataset
- Parse setup
- Parse
- Split frame
- Summary statistics
- Model build (GLM)
- Model build (GBM, XGBoost, Random forest)
- Job check (Every functionality correspond to particular job id, if the job status is done, then only it can be inferred that the step has been executed successfully)
- Prediction
- Predicted column values
- Export and import model from cloud storage
- Export and import frame from cloud storage
- Delete frames and models from memory

List of custom APIs:
- Univariate analysis
- Bivariate analysis
- Correlation analysis and its filter and drop feature
- Trend (For both descriptive and predictive)
- Seasonality (For both descriptive and predictive)
- Frequency (For both descriptive and predictive)
- Anomaly (For both descriptive and predictive)
- Retrieve the shape of the data set
- Delete frame and model from cloud storage
- Check column type if numeric or not

Business benefits

Use machine learning solutions to real-world problems
Identify whether the data you have is sufficient for ML
Carry a project through various ML phases including training, evaluation, and deployment
Perform AI responsibly and avoid reinforcing existing bias with ease
Discover ML use cases

Conclusion

With such ML platform, a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. Your models get to production faster with much less effort and lower cost. This would accelerate the ML process and assist developers and data scientists – needless to mention, core business users to make sense out of data and take necessary actions. With usage of micro-service and API based architecture, this can further be fed to a more rich visualization platform

INSIGHTS

Key Steps to build a Machine Learning Platform - Data Science Product Manager Perspective

Trending Posts