Chapter 1: History

Unsupervised machine learning, a subfield of artificial intelligence, aims to discover patterns and structure in data without using labeled responses Labeled responses: Data points that have known outcomes or categories. For example, in a dataset of animal images, each image might be labeled as "cat" or "dog." Unsupervised learning does not use these labels. . Its origins trace back to the early work in statistics and pattern recognition Pattern recognition: The process of classifying data based on statistical information extracted from patterns in the data. It involves identifying regularities, trends, or structures within datasets. in the mid-20th century.

1950s–1960s: Early clustering Clustering: The task of grouping a set of data points into clusters so that points in the same cluster are more similar to each other than to those in other clusters. methods (e.g., k-means developed by Stuart Lloyd in 1957) and principal component analysis Principal Component Analysis (PCA): A technique used to reduce the number of variables in a dataset by transforming the original variables into a new set of uncorrelated variables (principal components) that retain most of the variance in the data. (PCA) by Karl Pearson (1901) and Harold Hotelling (1933) laid the groundwork.
1970s–1980s: Emergence of algorithms like hierarchical clustering Hierarchical clustering: A method of cluster analysis which seeks to build a hierarchy of clusters, either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive). and self-organizing maps Self-organizing maps (SOMs): A type of artificial neural network used for clustering and visualizing high-dimensional data by mapping it onto a 2D grid in a way that preserves topological relationships. (Teuvo Kohonen, 1982).
1990s–2000s: Advances in density-based clustering Density-based clustering: A clustering approach that groups together data points that are closely packed (dense regions), marking points in sparse regions as outliers. DBSCAN is a popular density-based clustering algorithm. (DBSCAN, 1996), Gaussian mixture models Gaussian mixture models (GMM): Probabilistic models that assume data points are generated from a mixture of several Gaussian distributions with unknown parameters, often used for clustering and density estimation.
Example: Imagine recording the heights of all people at a theme park, which includes both children and adults. The overall height distribution forms two overlapping "bell curves" (Gaussians), one for children and one for adults. A Gaussian Mixture Model can automatically discover these two groups by modeling the data as a combination of two Gaussian distributions.

Illustration: The sum of two Gaussian curves (for children and adults) models the overall distribution.

, and manifold learning Manifold learning: A set of techniques for reducing the dimensionality of data based on the idea that high-dimensional data often lies on a lower-dimensional manifold embedded within the higher-dimensional space. .
Example: Imagine you have a long, curly ribbon on a table. If you look down from above, it looks like a twisty, two-dimensional shape. Manifold learning helps "unroll" this ribbon into a straight line so you can study its structure more simply.

Illustration: The curly "S" shape (left, high-dim) is mapped to a straight line (right, low-dim).
2010s–Present: Unsupervised deep learning Deep learning: A subset of machine learning using neural networks with many layers (deep neural networks) to learn hierarchical representations of data. In unsupervised settings, it includes models like autoencoders and GANs. (autoencoders, GANs) and applications in big data Big data: Extremely large and complex datasets that require advanced methods and technologies to store, process, and analyze. Unsupervised learning helps to find patterns in big data. , image, and text analysis.
Example: Deep learning models can learn to represent handwritten digits in layers. Early layers might detect simple features like lines or curves; deeper layers combine these to form digit shapes. This allows the model to learn complex structures by stacking many simple ones.

Illustration: Input features are transformed through multiple hidden layers, creating increasingly complex representations.

Chapter 2: Major Contributors

Karl Pearson — Introduced Principal Component Analysis (PCA) in 1901.
Stuart Lloyd — Developed the Lloyd's algorithm (k-means) in 1957.

Illustration: k-means clustering finds group centers (stars) and assigns points (dots) to the nearest group.
Teuvo Kohonen — Creator of Self-Organizing Maps (SOMs) in 1982.
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu — Developed DBSCAN in 1996.

Illustration: DBSCAN groups dense regions and marks outliers (crosses) as noise.
Geoffrey Hinton — Advanced unsupervised deep learning via autoencoders and deep belief nets.

Illustration: Autoencoder compresses (encodes) data to a bottleneck and reconstructs (decodes) it back.

Illustration: Deep Belief Net stacks multiple restricted Boltzmann machines for hierarchical feature learning.

Step 1: Input Data

The DBN receives raw input data (e.g., pixel values of an image) at the input layer.

Chapter 3: Major Algorithms

k-means Clustering
Partitions data into k clusters by minimizing within-cluster variance.
Hierarchical Clustering
Builds a hierarchy of clusters using agglomerative or divisive approaches.
Principal Component Analysis (PCA)
Reduces data dimensionality by finding new axes (principal components) that retain most of the data variance.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups data points that are closely packed together, marking outliers as noise.
Gaussian Mixture Models (GMM)
Models data as a mixture of multiple Gaussian distributions.
Autoencoders
Neural networks that learn to encode data efficiently, mainly used for dimensionality reduction or feature learning.
Visualizing Colors with Self-Organizing Maps (SOM)
How Self-Organizing Maps (SOM) Work
1. Initialize a 2D grid of "neurons" (each with a weight vector, e.g. RGB color).
2. For each data point (e.g., a color), find the "Best Matching Unit" (BMU) — the neuron whose weight is closest to the data point.
3. Update the BMU and its neighbors to move closer to the data point (using a learning rate and neighborhood function).
4. Repeat for many iterations, gradually reducing the learning rate and neighborhood size.
5. Result: Similar colors are mapped to nearby neurons on the grid, forming smooth color regions.
Let's see a simple example:

Step 1: Fake Color Dataset
We'll use 25 random colors (as RGB triples) to mimic a palette:
[ [237, 50, 55], [ 40, 230, 90], ... , [ 55, 65, 210] ]

Step 2: Organize on a 2D Grid
We'll lay out a 5×5 grid (25 neurons), and use a simple SOM algorithm so that each neuron learns to represent one of the input colors. Neighboring neurons on the grid will end up with similar colors.

Step 3: Result
The grid below shows how the SOM has organized the original colors into a smooth map—colors that are similar are close together on the grid!

Each cell shows a learned color.
Neighboring cells have similar colors, revealing natural groupings.

Chapter 4: Easy-to-Understand Examples

Clustering Customers: Understanding K-Means Algorithm
Suppose you have a list of customers and you want to group them based on their spending patterns. The k-means algorithm is a simple, popular way to do this.

How K-Means Works:

Choose k: Decide how many clusters (k) you want to find.
Initialize Centers: Place k cluster centers randomly on your data space.
Assign Points: Assign each customer to the nearest cluster center.
Update Centers: Move each center to the mean of all customers assigned to it.
Repeat: Repeat steps 3–4 until clusters stop changing.

Result: Each customer belongs to a cluster with others who have similar spending patterns.
K-means finds structure in your data, even if you don't know the groups ahead of time.

K-Means Step-by-Step Simulation

Step through the algorithm below to see how k-means clustering forms groups from customer data!

● Cluster 1 ● Cluster 2 ● Cluster 3 ★ Cluster Center

Dimensionality Reduction for Visualization: Understanding PCA

How Principal Component Analysis (PCA) Works

Standardize the data: Subtract the mean and divide by the standard deviation for each feature.
Compute the covariance matrix: Measure how variables vary together.
Calculate eigenvectors and eigenvalues: Find the directions (principal components) that capture the most variance.
Select top components: Keep the components with the largest eigenvalues (e.g., the top 2).
Project the data: Transform the original data onto the selected principal components (lower dimensions).

Let's see a simplified example:

Step 1: Create a Fake Dataset
Imagine you have 12 samples (customers), each with 100 features (spending in 100 product categories).
For demonstration, here are the first 3 samples (real PCA would use all 100!):

Sample 1: [2.1, -1.4, 0.5, ... , 1.2]
Sample 2: [2.0, -1.5, 0.8, ... , 1.0]
Sample 3: [1.9, -1.2, 0.3, ... , 0.9]
...

Each row is a customer, each column is a product.

Step 2: Standardize Each Feature
For each feature (column), subtract the mean and divide by the standard deviation.

X_std = (X - mean) / std

Step 3: Compute the Covariance Matrix
The covariance matrix shows how features vary together.

Cov(X) = (X_std)^T × X_std / (n_samples - 1)

Step 4: Calculate Eigenvectors and Eigenvalues
Find the directions (principal components) that capture the most variance.
Suppose the top 2 eigenvectors are:

PC1: [0.09, -0.10, ..., 0.13]
PC2: [0.05, 0.14, ..., 0.07]

Step 5: Project the Data onto Principal Components
Multiply your standardized data by the principal component vectors:

X_pca = X_std × [PC1, PC2]

Now each customer is a point in 2D space!

Each dot is a customer projected from 100D to 2D using PCA.
PC1 and PC2 are new axes that capture the most variance.

Anomaly Detection: Understanding DBSCAN

How DBSCAN Works

Choose parameters: Set eps (radius for neighborhood) and minPts (minimum neighbors to form a cluster).
Classify points: For each point, count how many others are within eps distance.
- If at least minPts neighbors: core point (part of a cluster)
- If not a core, but reachable from a core: border point
- If not reachable: noise (anomaly)
Expand clusters: Connect core points and their neighbors to form clusters.
Result: Points not assigned to any cluster are considered anomalies.

Let's see a step-by-step anomaly detection example:

Step 1: Fake Network Traffic Data
Each point is a network session with 2 features: duration (x) and bytes transferred (y).

[2.1, 8.2], [2.2, 7.9], [2.3, 8.4], ... (normal traffic)
[7.0, 2.2], [7.3, 2.1], ... (another pattern)
[5.5, 13.5] (anomaly)

Step 2: Choose Parameters
eps = 1.2 minPts = 3

Step 3: For Each Point, Find Neighbors
Count how many other points are within eps distance.
Example: Point [2.2, 7.9] has 3 close neighbors ⇒ core point.
[5.5, 13.5] has no close neighbors ⇒ noise (anomaly).

Step 4: Expand Clusters
Connect core and border points together. Any point not connected is an anomaly.

Step 5: Result
Points not assigned to any cluster are flagged as anomalies.

● Cluster 1 ● Cluster 2 × Anomaly (noise) ○ eps neighborhood

Image Compression: Understanding Autoencoders

How Autoencoders Work

Input Layer: Each pixel of the image is an input node.
Encoder: Compress the input into a small set of numbers (the bottleneck).
Decoder: Expand the bottleneck back into a full-size image.
Training: Adjust the network so the output image is as close as possible to the input image.
Compression: After training, the bottleneck representation is a compressed version of the original image.

Let's see a toy example of image compression:

Step 1: Fake Image Data
Suppose you have a 6×6 grayscale image (each value is a pixel brightness 0-1):

[[0, 0, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 0],
[1, 1, 0.8, 0.8, 1, 1],
[1, 1, 0.8, 0.8, 1, 1],
[0, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 0]]

Step 2: Encode (Compress)
How does the compression work?
The encoder learns to summarize the important structure of the image using only a few numbers.

In this example, we split the image into 4 quadrants (each 3×3 block):

Quadrant 1 (top-left): upper-left 3×3 pixels
Quadrant 2 (top-right): upper-right 3×3 pixels
Quadrant 3 (bottom-left): lower-left 3×3 pixels
Quadrant 4 (bottom-right): lower-right 3×3 pixels

For each quadrant, we average the pixel values to get a single number.

bottleneck[0] = average(Quadrant 1) = 0.64
bottleneck[1] = average(Quadrant 2) = 0.64
bottleneck[2] = average(Quadrant 3) = 0.64
bottleneck[3] = average(Quadrant 4) = 0.64

The bottleneck [0.64, 0.64, 0.64, 0.64] is the compressed representation.
In a real autoencoder, these would be learned values, not just averages—but the idea is the same: compress the information down to just the essentials.

Step 3: Decode (Reconstruct)
How does the reconstruction work?
The decoder takes the 4 bottleneck values and "spreads" each value back over its corresponding 3×3 quadrant. So, every pixel in each quadrant gets set to the same value as its bottleneck:

All pixels in Quadrant 1 become 0.64
All pixels in Quadrant 2 become 0.64
All pixels in Quadrant 3 become 0.64
All pixels in Quadrant 4 become 0.64

The result is a "blocky" version of the original image, but it still captures the main structure. In a real autoencoder, the decoder learns to reconstruct finer details from the compressed representation.

Left: Original image Right: Reconstructed from compressed code
(Compression: 36 → 4 values!)

Visualizing Colors with Self-Organizing Maps:
Imagine you have a palette of hundreds of colors. You want to organize them so that similar colors are close together on a map. A Self-Organizing Map (SOM) can take all these colors and arrange them on a 2D grid, where neighboring cells represent similar colors. This helps you see color families and smooth gradients at a glance, making it easy to navigate and choose colors.

Why is this helpful? SOMs help reveal underlying patterns in high-dimensional data (like RGB color values) by organizing and visualizing them in an intuitive way.

Machine Learning Concepts Quiz

Test your knowledge of clustering, PCA, DBSCAN, autoencoders, self-organizing maps, and more!
20 questions · Immediate feedback after you submit.
Incorrect answers will be highlighted with explanations.