SVD Explorer — Learn Singular Value Decomposition

The problem: a movie recommendation system

You're building the engine behind "movies you'll love" — a recommendation system. Millions of people, thousands of movies, one giant table of ratings — except almost every cell is blank (nobody watches everything). It all reduces to one task: rating prediction — guess the blanks. Predict how Dana would rate a film she's never seen, then recommend the ones she'd score highest.

Sounds hopeless — too many blanks, too much noise. But real taste isn't random: it runs along a few hidden patterns (action-lover? hopeless romantic? both?). Find those patterns and every person + movie collapses to a handful of numbers — and the blanks fill themselves in.

SVD is the machine that finds those patterns. Below is a tiny 6-person, 5-movie version so you can watch the whole pipeline work — data → patterns → predictions → recommendation.

The data: who rated what

ratings table (brighter gold = higher rating)

Eyeball it: top 3 people love the action movies, bottom 3 love the romance movies. There's a structure hiding in here. SVD finds it without being told.

SVD ranks the patterns by strength (σ)

Run SVD on the table. The singular values come out like this — two big, the rest tiny:

Two tall bars = two real patterns carry almost all the information. The short ones are just noise — barely any height at all.

So what IS a "pattern"?

A pattern is one taste axis — a single left-to-right line. SVD drops both every movie and every person onto that one line. The strongest informative one here is romance ↔ action, found with zero labels. Here's where each lands:

one shared axis — ◆ movies above the line, ● people below · gold = action end, blue = romance end

The whole point in one picture: action movies and action fans pile on the same side (right); romance movies and romance fans share the other (left). That lining-up is the pattern — and it's why your spot predicts your taste: you'll like the movies sitting near you. (Pattern 1, the strongest of all, is just "how much you rate things overall" — less interesting, so we show pattern 2.)

How you actually USE it Every person and every movie is now just 2 numbers — their scores on pattern 1 & 2.
predicted rating(person, movie) = combine those 2 numbers.
→ score a movie someone hasn't seen → that's a recommendation engine.
→ the sign of the score sorts people into camps and movies into genres, automatically.

Proof: rebuild from only k patterns

Keep the top k patterns, drop the rest, rebuild the whole table. At k=2 it snaps right back:

keep top 2 patterns

rebuilt (k patterns)

original

match: —

That's the entire point. 30 numbers → 2 patterns, almost no loss.

🎬 The payoff: fill the blanks → recommend

Back to the goal. Each person is now just 2 numbers (their pattern scores); each movie too. Multiply them → a predicted rating for every pair, including films they never watched. Pick a person and see what SVD recommends:

SVD's recommendation

Real systems run this on millions × millions with >99% of cells blank; a few hundred patterns fill them all. That's the core of Netflix, Spotify, Amazon. The next tabs show how SVD finds the patterns that make it work.

📍 The same recommendation, as a picture

Those "2 numbers" are coordinates. Drop every person and every movie as a point on one map — same space, same axes. Now a predicted rating has a shape: it's how far a movie reaches along your arrow. The movie reaching furthest your way is the pick; movies pointing the opposite way are the ones you'd hate.

● people◆ movies

Pick a different person above and watch the arrow swing. Close together = similar taste. The giants use this exact map — just millions of points and a few hundred axes instead of 2.

Where this shows up

Two everyday things, one piece of math. A vector is just a short list of numbers describing something real — a movie scored (action, romance), a screen pixel at (x, y), a person's (height, weight). A matrix is a rule that transforms them — literally what happens when you rotate, scale, or skew an image in Photoshop: every pixel coordinate gets multiplied by a 2×2 matrix. Nail "matrix × vector" here and the rest of the app is downhill.

Start here: a matrix is a machine that moves vectors

Everything in this app builds on one idea. A vector is an arrow. A matrix A takes an input arrow x and hands back an output arrow A·x — usually pointing somewhere else, usually a different length. That input→output move is the one thing SVD is about.

Dead-simple example A = 2003 (double x, triple y), x = 11
A·x = 2·1 + 0·10·1 + 3·1 = 23
→ the arrow got wider and taller. That's it.

Each output number = one row of A "dotted" with x (multiply matching entries, add). Drag the arrow below and watch it happen.

Play: drag x, watch A·x

Here A = [ 2 1 ; 1 2 ]. Drag the green arrow. The orange arrow is A·x. Press play to see each output coordinate get built from a row.

green = x (drag me) · orange = A·x

That's the only operation in this whole subject. Eigenvectors and SVD are just questions about this motion.

▶ The multiplication itself, highlighted

Forget arrows for a second — here's the raw arithmetic of A·x = b, with the exact cells lighting up as each number is used. This is all "matrix times vector" ever means.

Press play to walk through it.

Where this shows up: Google's PageRank

Google's original algorithm ranked the entire web by finding one special vector. Start with any guess of page importance, repeatedly apply "who links to whom," and it settles onto a single stable vector that stops changing — that's an eigenvector of the web's link matrix, and it became the search ranking. The same idea ranks sports teams, models populations, and finds a bridge's natural vibration modes. The thread running through all of them: a direction a transformation leaves pointing the same way.

Eigenvectors: the directions A doesn't rotate

From Tab 1: A usually rotates AND resizes an arrow — it swings it to a new angle and changes its length. But a few special directions get only resized — A keeps them on the same line (it may stretch, shrink, or flip them 180°, but never swings them off their line). Those are eigenvectors. The resize factor is the eigenvalue λ.

A v = λ v ("A on v just scales v — same line, no rotation")

Dead-simple example A = 2003
v = (1, 0): A·v = (2, 0) = 2·(1,0) ✓ same line! eigenvector, λ=2
v = (0, 1): A·v = (0, 3) = 3·(0,1) ✓ same line! eigenvector, λ=3
v = (1, 1): A·v = (2, 3) ✗ NOT a multiple of (1,1) → not an eigenvector

Play: hunt for them by dragging

Drag the green arrow v in a circle. Orange is A·v. Most directions: they disagree. Line them up on the same line → eigenvector found, it locks & glows.

green = v (drag me) · orange = A·v

Try rotation: A·v always turns 90°, so it's never parallel to v → no real eigenvectors. Some matrices just don't have them. That gap is exactly why SVD exists.

You already saw what SVD does

On Tab 0, SVD pulled 2 taste patterns out of a messy ratings table and rebuilt it almost perfectly. A = UΣVᵀ is just the general recipe behind that — it runs on any matrix and hands you its patterns, ranked strongest-first. This tab unpacks what the three letters actually are.

The one equation

A = U Σ V^T

Eigenvectors were fragile — not every matrix has them. SVD is the bulletproof version that works on any matrix (even non-square). Every matrix turns an input vector into an output vector, and SVD says that trip is always the same three moves:

V^T — a rotation of the input, possibly with a flip (pick the right axes to look along).
Σ — a pure stretch. Diagonal, ≥ 0. The singular values σ₁ ≥ σ₂ ≥ … say how much each axis scales.
U — a rotation of the output, possibly with a flip (where the stretched axes land).

Every input→output map = rotate the input → stretch → rotate into the output. SVD finds those axes. See it move in Tab 4.

⚠ Don't mix up two different "inputs" Easy trap: thinking you feed a vector into SVD. You don't.

• The matrix A is the thing that eats a vector → spits out a vector (lengths differ if A isn't square).
• SVD eats the whole matrix A → spits out the three pieces U, Σ, Vᵀ. No vector goes in.
• The "fewer numbers" payoff (Tab 0, Tab 6) is a third thing — keep only the top patterns and re-describe each row with a handful of scores. That's a choice you make after SVD, not SVD itself.

Simplest possible SVD

First, the star of the show: σ ("sigma") is a singular value = a stretch factor along one axis. Read it like a dial: σ > 1 stretches that direction, σ = 1 leaves it unchanged, σ < 1 shrinks it, σ = 0 flattens it to nothing. The matrix Σ is just those dials on its diagonal. That's all Σ ever is.

A diagonal matrix is already done A = 3002 → U = I, Σ = 3002, Vᵀ = I
→ singular values are just 3 and 2. No rotation needed.

A pure rotation has all σ = 1 A = 0-110 (turn 90°)
→ σ₁ = σ₂ = 1. It rotates but never stretches, so nothing shrinks.

The interesting matrices are in between — they rotate and stretch by different amounts. SVD untangles the two.

How SVD always works — even when eigenvectors don't exist

Tab 2 showed the catch: some matrices have no eigenvectors at all (the pure rotation had none). So we can't always lean on them directly.

SVD's trick: don't study A by itself — pair it with its mirror image first. That paired-up matrix is always well-behaved: it always has a full set of clean, perpendicular directions, no matter how nasty A is.

SVD borrows those directions and calls them V (the input axes) and U (the output axes), and reads the stretch amounts σ off the same step. You never do this by hand — the solver does it for you.

The one takeaway: SVD always exists, because it stands on a fixed-up matrix that simply can't misbehave.

the exact statement (only if you want it)

V = eigenvectors of AᵀA · U = eigenvectors of AAᵀ · σ = √(those eigenvalues).
("AᵀA" is the mirror-image pairing; it's symmetric, which is what guarantees clean perpendicular eigenvectors.)

Why SVD works: the ellipse fingerprint

Here's the picture under everything. Take the unit circle — every input direction of length 1. Any matrix maps that circle to an ellipse, and the ellipse is the matrix's fingerprint: its longest axis is the strongest pattern (the biggest σ, pointing in its direction), the next axis is the next pattern. "Find the patterns" from Tab 0 is literally "find the axes of this ellipse." Drag the matrix below and watch the fingerprint change.

Watch a matrix bend space

Edit 2×2 matrix A. Left is the input: the unit circle + grid. Right is the output: where A sends every input direction — always an ellipse. The colored arrows are its axes (the legend names them).

Matrix A

V (input axes)Uσ (output axes)

Decomposition

input space

output space (A·x)

Try singular: σ₂ → 0, circle collapses to a line (rank 1). Try rotate: σ₁=σ₂=1, circle stays a circle.

▶ Watch A = U Σ Vᵀ happen, step by step

Same A as above. Dots colored by starting angle so you can see rotations. Play: rotate by Vᵀ → stretch by Σ → rotate by U → lands on A·x. The dashed outline is the real A·(circle) — the dots land exactly on it at step 3, proving the 3 steps = A.

Press play.

scrub manually

The decoder ring for the notation: U & V only rotate — possibly with a flip — never resize; Σ is the only stretch. Three letters, three moves.

the matrix being applied lights up as the animation runs:

The question: how many patterns is a matrix really made of?

On Tab 0, two patterns rebuilt a 30-number table almost perfectly. That trick raises the question this tab answers: how many patterns does a matrix genuinely contain, and how much do you lose by dropping the rest? Two words make the answer precise.

Rank = the number of genuinely independent patterns in a matrix — i.e. how many singular values aren't ≈ 0. A table stuffed with numbers that's really "2 tastes + noise" has rank 2. Low rank = lots of redundancy = squeezable. How to find it: run SVD and count the σ's that aren't tiny — look for the cliff in the bars below.

Energy — "how much of the matrix" your kept patterns explain. Keep the patterns holding 99% of the energy and you toss only 1% (the noise) for a big size cut. That one trade-off is rank, energy, compression, and denoising — all the same move.

A is a sum of rank-1 layers

A = σ₁ u₁v₁^T + σ₂ u₂v₂^T + … (each piece = one pattern × its strength σ)

SVD restacks A as a pile of simple patterns, strongest first. Keep the first k and drop the rest → the best possible rebuild you can make with k patterns (provably — nothing with k patterns beats it). The bars below are the singular values of a random 6×6 — pick how many to keep.

Keep top 6 singular values

Original A

Rank-k reconstruction

Energy captured: 100% · Difference from original: 0.00

▶ See the layers stack up (as colors)

Numbers are hard to read. Same matrices as colors (gold = positive, blue = negative). Each σₖ uₖvₖᵀ is one outer product — the simplest matrix there is. Slider above drives this too.

Press "add next layer" to stack them one at a time.

layer added:
σₖ·uₖvₖᵀ

+…=

running sum
(rank-k)

→

target
A

🎵 If you know Fourier series — it's the same idea

A Fourier series builds a signal from a sum of simple waves, biggest first. SVD builds a matrix from a sum of simple rank-1 layers, biggest σ first. Same move:

	Fourier series	SVD
the thing	a signal	a matrix
building block	a sine/cosine wave	a rank-1 layer u_kv_kᵀ
its weight	amplitude	singular value σ_k
keep k terms	smooth approximation	best rank-k matrix
"energy"	Σ amplitude²	Σ σ²

Each layer bumps the rank by 1 — exactly like adding one more harmonic to a Fourier partial sum.

The twist: Fourier uses fixed, pre-chosen waves — the same set for every signal. SVD learns its own building blocks from the data, so its truncation is the best one possible for that exact matrix. In a line: SVD is a Fourier series that discovers its own waves.

▶ Fourier in motion — adding harmonics = adding layers

Each spinning circle is one wave (one "layer"). Their stacked tip (pink) traces the curve on the right (green). Add harmonics → the curve snaps toward a sharp square wave — the exact same way adding rank-1 layers snaps toward A.

harmonics / layers = 2

1 harmonic = a lone sine (rank-1). Crank it: each smaller circle sharpens the corners a little more — and just like SVD's σ's, the early ones do nearly all the work.

Compress a real image with SVD

This is the exact same move as the Netflix table (Tab 0) — because a grayscale image is a matrix, where each number is a pixel's brightness. SVD it, keep the top-k patterns, reconstruct. A 1000×1000 photo is a million numbers, yet often only ~30 patterns make it look right. Quality vs. storage:

Rank k = 20

original (rank 64)

rank-k

Stored: — vs original — · Compression: — · Energy kept: —

Each pattern you keep costs a little storage; each one you drop saves it (watch the numbers above). Even a handful keeps the image recognizable — most of what your eye needs lives in the first few patterns.

📖 Every term on this page, in plain words

Skim it once, then come back whenever a symbol stops making sense. Nothing here assumes the others.

The basics

Scalar: A single number. Like 3.
Vector: A list of numbers = an arrow in space. (2, 3) is an arrow 2 right, 3 up.
Matrix: A grid of numbers. Also: a machine that moves vectors (Tab 1). That dual meaning is the whole game.
Dimension (m×n): Size of a matrix: m rows, n columns. A 6×5 matrix has 6 rows, 5 columns.
A·x (matrix × vector): Apply the machine A to arrow x → a new arrow. Each output = one row of A "dotted" with x.
Dot product: Multiply two lists entry-by-entry, add up. (2,1)·(3,1) = 2·3+1·1 = 7.
Transpose (ᵀ): Flip a matrix over its diagonal — rows become columns. Vᵀ = "V flipped."
Identity (I): The "do-nothing" matrix. I·x = x. 1's on the diagonal, 0 elsewhere.
Diagonal matrix: Only the diagonal is nonzero. Acts as a pure stretch — no rotation.

Eigen- & singular things

Eigenvector: A direction that A only stretches, never turns (Tab 2). Most matrices have a few; some have none.
Eigenvalue (λ): How much that eigenvector gets stretched. A·v = λ·v.
Singular value (σ): A stretch factor along one of SVD's axes. Always ≥ 0. The bulletproof version of an eigenvalue — every matrix has them. σ>1 stretches, σ=1 no change, σ=0 flattens.
Orthonormal / rotation: A set of directions that are all perpendicular and length 1. A matrix built from them just rotates space, possibly with a flip (never resizes). U and V are these.
U, Σ, V: The three pieces of SVD: Vᵀ rotates the input, Σ stretches by the σ's, U rotates the output. A = UΣVᵀ.

The big ideas

Rank ★: The number of genuinely independent patterns in a matrix = how many singular values are nonzero. A 6×5 ratings table that's really "2 tastes + noise" has rank 2, even though it's full of numbers. Low rank = lots of redundancy = squeezable. How to find it: run SVD and count the singular values that aren't ≈0 — on Tab 0 you see 2 big and 3 tiny, so rank 2. (This is what Tab 5 is named after.)
Outer product / rank-1: One column times one row → a full matrix that's the simplest kind (rank 1). σ·u·vᵀ is one of these — a single pattern.
Low-rank approximation: Keep only the top-k singular values → the best possible rebuild of A using k patterns (Tabs 5 & 6). The core use of SVD.
Energy: Σσᵢ². "How much of the matrix" your kept patterns explain. 99% energy in k=2 means 2 patterns is plenty.
Condition number: σ₁ / σ_last. Big = the matrix nearly flattens some direction → unstable to invert.
PCA: "Principal Component Analysis" — SVD applied to centered data to find its main axes of variation. Same machinery.

Where this landed

This page opened with blanks in a ratings table, and every tab since has been the same move in a different costume: hand a big messy matrix to SVD, keep the strongest patterns, throw away the rest. In 1990, a group at Bellcore led by Scott Deerwester ran exactly that move on a word-by-document count matrix and noticed something remarkable — the compressed coordinates captured meaning. Words used in similar contexts landed near each other, even when they never appeared in the same document. They called it Latent Semantic Analysis, and its coordinates were, in everything but name, the first word embeddings.

That is the thread that runs forward to today: represent a thing — a movie, a person, a word — as its coordinates along a few learned pattern axes, and let geometry do the reasoning. word2vec, GloVe, and the embedding layers inside modern LLMs all stand on that idea. The full story is told in How LLMs Came to Be — this page is the math underneath its LSA chapter.