Unsupervised Machine Learning Cheat Sheet

In this cheat sheet, you'll have a guide around the top unsupervised machine learning algorithms, their advantages and disadvantages and use cases.

Dec 2022 · 9 min read

Have this cheat sheet at your fingertips

Download PDF

Unsupervised Learning

Unsupervised learning is about discovering general patterns in data. The most popular example is clustering or segmenting customers and users. This type of segmentation is generalizable and can be applied broadly, such as to documents, companies, and genes. Unsupervised learning consists of clustering models that learn how to group similar data points together or association algorithms that group different data points based on pre-defined rules.

Clustering Models

Algorithm	Description and Application	Advantages	Disadvantages
K-Means	Most common clustering approach which assumes that the closer data points are to each other, the more similar they are. It determines K clusters based on Euclidean distances.	Scales to large datasets. Interpretable & explainable results Can generate tight clusters	Requires defining the expected number of clusters in advance. Not suitable to identify clusters with non-convex shapes.
DBSCAN	Density-Based Spatial Clustering of Applications with Noise can handle non-linear cluster structures, purely based on density. It can differentiate and separate regions with varying degrees of density, thereby creating clusters.	No assumption on the expected number of clusters. Can handle noisy data and outliers No assumptions on the shapes and sizes of the clusters Can identify clusters with different densities	Requires optimization of two parameters. Can struggle in case of very high dimensional data
HDBSCAN	Family of the density-based algorithms and has roughly two steps: finding the core distance of each point, and expands clusters from them. It extends DBSCAN by converting it into a hierarchical clustering algorithm.	No assumption on the expected number of clusters   Can handle noisy data and outliers. No assumptions on the shapes and sizes of the clusters. Can identify clusters with different densities	Mapping of unseen objects in HDBSCAN is not straightforward. Can be computationally expensive
Agglomerative  Hierarchical Clustering	Uses hierarchical clustering to determine the distance between samples based on the metric, and pairs are merged into clusters using the linkage type.	There is no need to specify the number of clusters. With the right linkage, it can be used for the detection of outliers. Interpretable results using dendrograms.	Specifying metric and linkages types requires good understanding of the statistical properties of the data Not straightforward to optimize Can be computationally expensive for large datasets
OPTICS	Family of the density-based algorithms where it finds core sample of high density and expands clusters from them. It operates with a core distance (ɛ) and reachability distance.	No assumption on the expected number of clusters. Can handle noisy data and outliers. No assumptions on the shapes and sizes of the clusters. Can identify clusters with different densities. Not required to define a fixed radius as in DBSCAN.	It only produces a cluster ordering. Does not work well in case of very high dimensional data. Slower than DBSCAN.
Gaussian Mixture Models	Gaussian Mixture Models (GMM) leverages probabilistic models to detect clusters using a mixture of normal (gaussian) distributions.	Provides uncertainty measures for each observation Can identify overlapping clusters	Requires defining the expected number of clusters or mixture components in advance The covariance type needs to be defined for the mixture of components

Association Rules

Algorithm	Description and Application	Advantages	Disadvantages
Apriori algorithm	The Apriori algorithm uses the join and prune step iteratively to identify the most frequent itemset in the given dataset. Prior knowledge (apriori) of frequent itemset properties is used in the process.	Explainable & interpretable results. Exhaustive approach based on the confidence and support.	Requires defining the expected number of clusters or mixture components in advance The covariance type needs to be defined for the mixture of component
FP-growth algorithm	Frequent Pattern growth (FP-growth) is an improvement on the Apriori algorithm for finding frequent itemsets. It generates a conditional FP-Tree for every item in the data.	Explainable & interpretable results. Smaller memory footprint than the Apriori algorithm	More complex algorithm to build than Apriori Can result in many (incremental) overlapping/trivial itemsets
FP-Max Algorithm	A variant of Frequent pattern growth that is focused on finding maximal itemsets.	Explainable & Interpretable results. Smaller memory footprint than the Apriori and FP-growth algorithms	More complex algorithm to build than Apriori
Eclat	Equivalence Class Clustering and Bottom-Up Lattice Traversal (Eclat) applies a Depth-First Search of a graph procedure. This is a more efficient and scalable version of the Apriori algorithm.	Explainable & interpretable results. Computational faster compared to the Apriori algorithm	Can provide only a subset of results in contrast to the Apriori algorithm and its variants
Hypergeometric Networks	HNet learns the Association from datasets with mixed data types (discrete and continuous variables) and with unknown functions. Associations are statistically tested using the hypergeometric distribution for finding frequent itemset.	Explainable & Interpretable results More robust against spurious associations as it uses statistical inferences Can associate discrete (itemsets) in combination with continuous measurements Can handle missing values	Computationally intensive for very large datasets.

Dimensionality Reduction

Algorithm	Description and Application	Advantages	Disadvantages
PCA	Principal Component Analysis (PCA) is a feature extraction approach that uses a linear function to reduce dimensionality in datasets by minimizing information loss.	Explainable & Interpretable results. New unseen datapoints can be mapped into the existing PCA space. Can be used as a dimensionality reduction technique as a preliminary step to other machine learning tasks Helps reduce overfitting Helps remove correlated features	Sensitive to outliers Requires data standardization
t-SNE	t-distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction method that converts similarities between data points to joint probabilities using the Student t-distribution in the low-dimensional space	Helps preserve the relationships seen in high dimensionality Easy to visualise the structure of high-dimensional data in 2 or 3 dimensions Very effective for visualizing clusters or groups of data points and their relative proximities	The cost function is not convex; different initializations can get different results. Computationally intensive for large datasets. Default parameters do not always achieve the best results
UMAP	Uniform Manifold Approximation and Projection (UMAP) constructs a high-dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible.	It can be used as a general-purpose dimension reduction technique as a preliminary step to other machine learning tasks. Can be very effective for visualizing clusters or groups of data points and their relative proximities. Able to handle high dimensional sparse datasets	Default parameters do not always achieve the best results
ICA	Independent Component Analysis (ICA) is a linear dimensionality reduction method that aims to separate a multivariate signal into additive subcomponents under the assumption that independent components are non-gaussian. Where PCA "compresses" the data, ICA "separates" the information.	Can separate multivariate signals into its subcomponents. Clear aim of the method; only applicable if there are multiple independent generators of information to uncover. Can extract hidden factors in the data by transforming a set of variables to a new set that is maximally independent.	Without any prior knowledge, determination of the number of independent components or sources can be difficult. PCA is often required as a pre-processing step.
PaCMAP	Pairwise Controlled Manifold Approximation (PaCMAP) is a dimensionality reduction method that optimizes low-dimensional embeddings using three kinds of point pairs: neighbor pairs, mid-near pair, and further pairs.	It can preserve both local and global structure of the data in original space. Performance is relatively robust within reasonable parameter choices.	Parameters have been tuned on smaller datasets and it is yet unknown how it behaves and extends to very high-dimensional datasets

Topics

Machine Learning

blog

Supervised Machine Learning

Discover what supervised machine learning is, how it compares to unsupervised machine learning and how some essential supervised machine learning algorithms work

Moez Ali

8 min

blog

Introduction to Unsupervised Learning

Learn about unsupervised learning, its types—clustering, association rule mining, and dimensionality reduction—and how it differs from supervised learning.

Kurtis Pykes

9 min

blog

10 Top Machine Learning Algorithms & Their Use-Cases

Machine learning is arguably responsible for data science and artificial intelligence’s most prominent and visible use cases. In this article, learn about machine learning, some of its prominent use cases and algorithms, and how you can get started.