Supervised Machine Learning Cheat Sheet

In this cheat sheet, you'll have a guide around the top supervised machine learning algorithms, their advantages and disadvantages, and use-cases.

Dec 2022 · 5 min read

Have this cheat sheet at your fingertips

Download PDF

Supervised Learning

Supervised learning models are models that map inputs to outputs, and attempt to extrapolate patterns learned in past data on unseen data. Supervised learning models can be either regression models, where we try to predict a continuous variable, like stock prices—or classification models, where we try to predict a binary or multi-class variable, like whether a customer will churn or not. In the section below, we'll explain three popular types of supervised learning models: regression-only models, regression and classification models, and classification-only models.

Regression Only Models

Algorithm	Description and Application	Advantages	Disadvantages
Linear Regression	Linear Regression models a linear relationship between input variables and a continuous numerical output variable. The default loss function is the mean square error (MSE).	Fast training because there are few parameters. Interpretable/Explainable results by its output coefficients.	Assumes a linear relationship between input and output variables. Sensitive to outliers. Typically generalizes worse than ridge or lasso regression.
Polynomial Regression	Polynomial Regression models nonlinear relationships between the dependent, and independent variable as the n-th degree polynomial.	Provides a good approximation of the relationship between the dependent and independent variables. Capable of fitting a wide range of curvature.	Poor interpretability of the coefficients since the underlying variables can be highly correlated. The model fit is nonlinear but the regression function is linear. Prone to overfitting.
Support Vector  Regression	Support Vector Regression (SVR) uses the same principle as SVMs but optimizes the cost function to fit the most straight line (or plane) through the data points. With the kernel trick it can efficiently perform a non-linear regression by implicitly mapping their inputs into high-dimensional feature spaces.	Robust against outliers.   Effective learning and strong generalization performance. Different Kernel functions can be specified for the decision function.	Does not perform well with large datasets. Tends to underfit in cases where the number of variables is much smaller than the number of observations.
Gaussian Process  Regression	Gaussian Process Regression (GPR) uses a Bayesian approach that infers a probability distribution over the possible functions that fit the data. The Gaussian process is a prior that is specified as a multivariate Gaussian distribution.	Provides uncertainty measures on the predictions. It is a flexible and usable non-linear model which fits many datasets well. Performs well on small datasets as the GP kernel allows to specify a prior on the function space.	Poor choice of kernel can make convergence slow. Specifying specific kernels requires deep mathematical understanding.
Robust Regression	Robust Regression is an alternative to least squares regression when data is contaminated with outliers. The term “robust” refers to the statistical capability to provide useful information even in the face of outliers.	Designed to overcome some limitations of traditional parametric and non-parametric methods. Provides better regression coefficient over classical regression methods when outliers are present.	More computationally intensive compared to classical regression methods. It is not a cure-all for all violations, such as imbalanced data, poor quality data. If no outliers are present in the data, it may not provide better results than  classical regression methods. Tree-based models

Both Regression and Classification Models

Algorithm	Description and Application	Advantages	Disadvantages
Decision Trees	Decision Tree models learn on the data by making decision rules on the variables to separate the classes in a flowchart like a tree data structure. They can be used for both regression and classification.	Explainable and interpretable. Can handle missing values.	Prone to overfitting. Can be unstable with minor data drift. Sensitive to outliers.
Random Forest	Random Forest classification models learn using an ensemble of decision trees. The output of the random forest is based on a majority vote of the different decision trees.	Effective learning and better generalization performance. Can handle moderately large datasets. Less prone to overfit than decision trees.	Large number of trees can slow down performance. Predictions are sensitive to outliers. Hyperparameter tuning can be complex.
Gradient Boosting	An ensemble learning method where weak predictive learners are combined to improve accuracy. Popular techniques include XGBoost, LightGBM and more.	Handling of multicollinearity. Handling of non-linear relationships. Effective learning and strong generalization performance. XGBoost is fast and is often used as a benchmark algorithm.	Sensitive to outliers and can therefore cause overfitting. High complexity due to hyperparameter tuning. Computationally expensive.
Ridge Regression	Ridge Regression penalizes variables with low predictive outcomes by shrinking their coefficients towards zero. It can be used for classification and regression.	Less prone to overfitting. Best suited when data suffers from multicollinearity. Explainable & Interpretable.	All the predictors are kept in the final model. Doesn't perform feature selection.
Lasso Regression	Lasso Regression penalizes features that have low predictive outcomes  by shrinking their coefficients to zero. It can be used for classification  and regression.	Good generalization performance. Good at handling datasets where the number of variables is much larger than the number of observations. No need for feature selection.	Poor interpretability/explainability as it can keep a single variable.  from a set of highly correlated variables.
AdaBoost	Adaptive Boosting uses an ensemble of weak learners that is combined into a weighted sum that represents the final output of the boosted classifier.	Explainable & Interpretable. Less need for tweaking parameters. Usually outperforms Random Forest.	Less prone to overfitting as the input variables are not jointly optimized. Sensitive to noisy data and outliers.

Classification Only Models

Algorithm	Description and Application	Advantages	Disadvantages
SVM	In its simplest form, support vector machine is a linear classifier. But with the  kernel trick, it can efficiently perform a non-linear classification by implicitly  mapping their inputs into high-dimensional feature spaces. This makes SVM one  of the best prediction methods.	Effective in cases with a high number of variables. Number of variables can be larger than the number of samples. Different Kernel functions can be specified for the decision function.	Sensitive to overfitting, regularization is crucial. Choosing a “good” kernel function can be difficult. Computationally expensive for big data due to high training complexity. Performs poorly if the data is noisy (target classes overlap).
Nearest  Neighbors	Nearest Neighbors predicts the label based on a predefined number of samples closest in distance to the new point.	Successful in situations where the decision boundary is irregular. Non-parametric approach as it does not make  any assumption on the underlying data.	Sensitive to noisy and missing data. Computationally expensive because the entire set of n points for every execution  is required.
Logistic Regression  (and its extensions)	The logistic regression models a linear relationship between input variables and the response variable. It models the output as binary values (0 or 1)  rather than numeric values.	Explainable & Interpretable. Less prone to overfitting using regularization. Applicable for multi-class predictions.	Makes a strong assumption about the relationship between input and response variables. Multicollinearity can cause the model to easily overfit without regularization.
Linear Discriminant  Analysis	The linear decision boundary maximizes the separability between the classes by finding a linear combination of features.	Explainable & Interpretable. Applicable for multi-class predictions.	Multicollinearity can cause the model to overfit. Assuming that all classes share the same covariance matrix. Sensitive to outliers. Doesn't work well with small class sizes.

Have this cheat sheet at your fingertips

Download PDF

Topics

Machine Learning

blog

Supervised Machine Learning

Discover what supervised machine learning is, how it compares to unsupervised machine learning and how some essential supervised machine learning algorithms work

Moez Ali

8 min

blog

10 Top Machine Learning Algorithms & Their Use-Cases

Machine learning is arguably responsible for data science and artificial intelligence’s most prominent and visible use cases. In this article, learn about machine learning, some of its prominent use cases and algorithms, and how you can get started.