Chapter 1: History
Unsupervised machine learning, a subfield of artificial intelligence, aims to discover patterns and structure in data without using labeled responses Labeled responses: Data points that have known outcomes or categories. For example, in a dataset of animal images, each image might be labeled as "cat" or "dog." Unsupervised learning does not use these labels. . Its origins trace back to the early work in statistics and pattern recognition Pattern recognition: The process of classifying data based on statistical information extracted from patterns in the data. It involves identifying regularities, trends, or structures within datasets. in the mid-20th century.
- 1950s–1960s: Early clustering Clustering: The task of grouping a set of data points into clusters so that points in the same cluster are more similar to each other than to those in other clusters. methods (e.g., k-means developed by Stuart Lloyd in 1957) and principal component analysis Principal Component Analysis (PCA): A technique used to reduce the number of variables in a dataset by transforming the original variables into a new set of uncorrelated variables (principal components) that retain most of the variance in the data. (PCA) by Karl Pearson (1901) and Harold Hotelling (1933) laid the groundwork.
- 1970s–1980s: Emergence of algorithms like hierarchical clustering Hierarchical clustering: A method of cluster analysis which seeks to build a hierarchy of clusters, either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive). and self-organizing maps Self-organizing maps (SOMs): A type of artificial neural network used for clustering and visualizing high-dimensional data by mapping it onto a 2D grid in a way that preserves topological relationships. (Teuvo Kohonen, 1982).
-
1990s–2000s: Advances in
density-based clustering
Density-based clustering: A clustering approach that groups together data points that are closely packed (dense regions), marking points in sparse regions as outliers. DBSCAN is a popular density-based clustering algorithm.
(DBSCAN, 1996),
Gaussian mixture models
Gaussian mixture models (GMM): Probabilistic models that assume data points are generated from a mixture of several Gaussian distributions with unknown parameters, often used for clustering and density estimation.
Example: Imagine recording the heights of all people at a theme park, which includes both children and adults. The overall height distribution forms two overlapping "bell curves" (Gaussians), one for children and one for adults. A Gaussian Mixture Model can automatically discover these two groups by modeling the data as a combination of two Gaussian distributions., and manifold learning Manifold learning: A set of techniques for reducing the dimensionality of data based on the idea that high-dimensional data often lies on a lower-dimensional manifold embedded within the higher-dimensional space. .Illustration: The sum of two Gaussian curves (for children and adults) models the overall distribution.Example: Imagine you have a long, curly ribbon on a table. If you look down from above, it looks like a twisty, two-dimensional shape. Manifold learning helps "unroll" this ribbon into a straight line so you can study its structure more simply.
Illustration: The curly "S" shape (left, high-dim) is mapped to a straight line (right, low-dim). -
2010s–Present: Unsupervised
deep learning
Deep learning: A subset of machine learning using neural networks with many layers (deep neural networks) to learn hierarchical representations of data. In unsupervised settings, it includes models like autoencoders and GANs.
(autoencoders, GANs) and applications in
big data
Big data: Extremely large and complex datasets that require advanced methods and technologies to store, process, and analyze. Unsupervised learning helps to find patterns in big data.
, image, and text analysis.
Example: Deep learning models can learn to represent handwritten digits in layers. Early layers might detect simple features like lines or curves; deeper layers combine these to form digit shapes. This allows the model to learn complex structures by stacking many simple ones.Illustration: Input features are transformed through multiple hidden layers, creating increasingly complex representations.
Chapter 2: Major Contributors
- Karl Pearson — Introduced Principal Component Analysis (PCA) in 1901.
-
Stuart Lloyd — Developed the Lloyd's algorithm (k-means) in 1957.
Illustration: k-means clustering finds group centers (stars) and assigns points (dots) to the nearest group.
- Teuvo Kohonen — Creator of Self-Organizing Maps (SOMs) in 1982.
-
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu — Developed DBSCAN in 1996.
Illustration: DBSCAN groups dense regions and marks outliers (crosses) as noise.
-
Geoffrey Hinton — Advanced unsupervised deep learning via autoencoders and deep belief nets.
Illustration: Autoencoder compresses (encodes) data to a bottleneck and reconstructs (decodes) it back.Illustration: Deep Belief Net stacks multiple restricted Boltzmann machines for hierarchical feature learning.Step 1: Input DataThe DBN receives raw input data (e.g., pixel values of an image) at the input layer.
Chapter 3: Major Algorithms
-
k-means Clustering
Partitions data into k clusters by minimizing within-cluster variance. -
Hierarchical Clustering
Builds a hierarchy of clusters using agglomerative or divisive approaches. -
Principal Component Analysis (PCA)
Reduces data dimensionality by finding new axes (principal components) that retain most of the data variance. -
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups data points that are closely packed together, marking outliers as noise. -
Gaussian Mixture Models (GMM)
Models data as a mixture of multiple Gaussian distributions. -
Autoencoders
Neural networks that learn to encode data efficiently, mainly used for dimensionality reduction or feature learning. -
Visualizing Colors with Self-Organizing Maps (SOM)How Self-Organizing Maps (SOM) Work
- Initialize a 2D grid of "neurons" (each with a weight vector, e.g. RGB color).
- For each data point (e.g., a color), find the "Best Matching Unit" (BMU) — the neuron whose weight is closest to the data point.
- Update the BMU and its neighbors to move closer to the data point (using a learning rate and neighborhood function).
- Repeat for many iterations, gradually reducing the learning rate and neighborhood size.
- Result: Similar colors are mapped to nearby neurons on the grid, forming smooth color regions.
Let's see a simple example:Step 1: Fake Color Dataset
We'll use 25 random colors (as RGB triples) to mimic a palette:[ [237, 50, 55], [ 40, 230, 90], ... , [ 55, 65, 210] ]Step 2: Organize on a 2D Grid
We'll lay out a 5×5 grid (25 neurons), and use a simple SOM algorithm so that each neuron learns to represent one of the input colors. Neighboring neurons on the grid will end up with similar colors.Step 3: Result
The grid below shows how the SOM has organized the original colors into a smooth map—colors that are similar are close together on the grid!Each cell shows a learned color.
Neighboring cells have similar colors, revealing natural groupings.
Chapter 4: Easy-to-Understand Examples
Suppose you have a list of customers and you want to group them based on their spending patterns. The k-means algorithm is a simple, popular way to do this.
- Choose k: Decide how many clusters (k) you want to find.
- Initialize Centers: Place k cluster centers randomly on your data space.
- Assign Points: Assign each customer to the nearest cluster center.
- Update Centers: Move each center to the mean of all customers assigned to it.
- Repeat: Repeat steps 3–4 until clusters stop changing.
K-means finds structure in your data, even if you don't know the groups ahead of time.
- Standardize the data: Subtract the mean and divide by the standard deviation for each feature.
- Compute the covariance matrix: Measure how variables vary together.
- Calculate eigenvectors and eigenvalues: Find the directions (principal components) that capture the most variance.
- Select top components: Keep the components with the largest eigenvalues (e.g., the top 2).
- Project the data: Transform the original data onto the selected principal components (lower dimensions).
Imagine you have 12 samples (customers), each with 100 features (spending in 100 product categories).
For demonstration, here are the first 3 samples (real PCA would use all 100!):
Sample 2: [2.0, -1.5, 0.8, ... , 1.0]
Sample 3: [1.9, -1.2, 0.3, ... , 0.9]
...
For each feature (column), subtract the mean and divide by the standard deviation.
The covariance matrix shows how features vary together.
Find the directions (principal components) that capture the most variance.
Suppose the top 2 eigenvectors are:
PC2: [0.05, 0.14, ..., 0.07]
Multiply your standardized data by the principal component vectors:
PC1 and PC2 are new axes that capture the most variance.
- Choose parameters: Set
eps(radius for neighborhood) andminPts(minimum neighbors to form a cluster). - Classify points: For each point, count how many others are within
epsdistance.- If at least
minPtsneighbors: core point (part of a cluster) - If not a core, but reachable from a core: border point
- If not reachable: noise (anomaly)
- If at least
- Expand clusters: Connect core points and their neighbors to form clusters.
- Result: Points not assigned to any cluster are considered anomalies.
Each point is a network session with 2 features: duration (x) and bytes transferred (y).
[7.0, 2.2], [7.3, 2.1], ... (another pattern)
[5.5, 13.5] (anomaly)
eps = 1.2 minPts = 3
Count how many other points are within eps distance.
Example: Point [2.2, 7.9] has 3 close neighbors ⇒ core point.
[5.5, 13.5] has no close neighbors ⇒ noise (anomaly).
Connect core and border points together. Any point not connected is an anomaly.
Points not assigned to any cluster are flagged as anomalies.
- Input Layer: Each pixel of the image is an input node.
- Encoder: Compress the input into a small set of numbers (the bottleneck).
- Decoder: Expand the bottleneck back into a full-size image.
- Training: Adjust the network so the output image is as close as possible to the input image.
- Compression: After training, the bottleneck representation is a compressed version of the original image.
Suppose you have a 6×6 grayscale image (each value is a pixel brightness 0-1):
[0, 1, 1, 1, 1, 0],
[1, 1, 0.8, 0.8, 1, 1],
[1, 1, 0.8, 0.8, 1, 1],
[0, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 0]]
How does the compression work?
The encoder learns to summarize the important structure of the image using only a few numbers.
In this example, we split the image into 4 quadrants (each 3×3 block):
- Quadrant 1 (top-left): upper-left 3×3 pixels
- Quadrant 2 (top-right): upper-right 3×3 pixels
- Quadrant 3 (bottom-left): lower-left 3×3 pixels
- Quadrant 4 (bottom-right): lower-right 3×3 pixels
bottleneck[1] = average(Quadrant 2) = 0.64
bottleneck[2] = average(Quadrant 3) = 0.64
bottleneck[3] = average(Quadrant 4) = 0.64
In a real autoencoder, these would be learned values, not just averages—but the idea is the same: compress the information down to just the essentials.
How does the reconstruction work?
The decoder takes the 4 bottleneck values and "spreads" each value back over its corresponding 3×3 quadrant. So, every pixel in each quadrant gets set to the same value as its bottleneck:
- All pixels in Quadrant 1 become 0.64
- All pixels in Quadrant 2 become 0.64
- All pixels in Quadrant 3 become 0.64
- All pixels in Quadrant 4 become 0.64
(Compression: 36 → 4 values!)
Imagine you have a palette of hundreds of colors. You want to organize them so that similar colors are close together on a map. A Self-Organizing Map (SOM) can take all these colors and arrange them on a 2D grid, where neighboring cells represent similar colors. This helps you see color families and smooth gradients at a glance, making it easy to navigate and choose colors.
Why is this helpful? SOMs help reveal underlying patterns in high-dimensional data (like RGB color values) by organizing and visualizing them in an intuitive way.
Machine Learning Concepts Quiz
20 questions · Immediate feedback after you submit.
Incorrect answers will be highlighted with explanations.