Cosine Similarity

In the previous lesson, we learned that the dot product measures alignment between vectors. But there is a problem: the dot product is affected by the magnitude of the vectors, not just their direction. Two very long vectors pointing in the same direction will have a huge dot product, while two short vectors pointing in the same direction will have a small one. In AI, we often care about direction only. Cosine similarity solves this problem, and it is the standard way AI systems compare embeddings, documents, and data points.

The Problem with Raw Dot Products

Consider two document vectors representing word frequencies:

Short article:  a = [1, 2, 1]
Long article:   b = [10, 20, 10]

These vectors point in exactly the same direction (b is just 10 times longer), meaning both documents use words in the same proportions. But their dot product is:

a . b = (1*10) + (2*20) + (1*10) = 60

Compare that to two short documents with different content:

a = [1, 2, 1]
c = [2, 1, 2]

a . c = (1*2) + (2*1) + (1*2) = 6

The dot product with b is 10 times larger, but only because b is longer, not because it is more similar in meaning. We need a metric that ignores magnitude.

Cosine Similarity Defined

Cosine similarity normalizes the dot product by dividing by the magnitudes of both vectors:

cosine_similarity(a, b) = (a . b) / (|a| * |b|)

Where |a| = sqrt(a1^2 + a2^2 + ... + an^2) is the magnitude (length) of vector a.

This is equivalent to computing the cosine of the angle between the two vectors. The result tells you purely about direction, completely removing the effect of magnitude.

The Range of Cosine Similarity

Because we are computing a cosine, the result always falls in a predictable range:

Value	Meaning	Angle Between Vectors
1	Identical direction	0 degrees
0	Completely unrelated	90 degrees
-1	Opposite direction	180 degrees

For most AI applications using non-negative values (like word counts or embedding dimensions), cosine similarity falls between 0 and 1.

Worked Example: Comparing Document Vectors

Suppose we represent three documents by how often they mention key topics: [math, science, sports].

Doc A (math textbook):    [8, 6, 0]
Doc B (science journal):  [3, 9, 1]
Doc C (sports article):   [0, 1, 7]

Similarity between Doc A and Doc B:

a . b = (8*3) + (6*9) + (0*1) = 24 + 54 + 0 = 78
|a| = sqrt(64 + 36 + 0) = sqrt(100) = 10
|b| = sqrt(9 + 81 + 1) = sqrt(91) = 9.54
cosine_similarity = 78 / (10 * 9.54) = 78 / 95.4 = 0.817

Similarity between Doc A and Doc C:

a . c = (8*0) + (6*1) + (0*7) = 0 + 6 + 0 = 6
|a| = 10
|c| = sqrt(0 + 1 + 49) = sqrt(50) = 7.07
cosine_similarity = 6 / (10 * 7.07) = 6 / 70.7 = 0.085

Doc A is very similar to Doc B (0.817) but nearly unrelated to Doc C (0.085). This matches our intuition: a math textbook is related to a science journal but not to a sports article.

Why Normalization Matters

Dividing by the magnitudes is a form of normalization. It puts all vectors on equal footing regardless of their scale. This is critical in AI because:

Document vectors vary in length based on document size, not content
Embedding vectors can have different magnitudes depending on the model
Feature vectors may have wildly different scales across dimensions

By normalizing, cosine similarity answers the question: "Do these vectors point in the same direction?" rather than "Are these vectors the same size?"

Cosine Distance

While cosine similarity measures how alike two vectors are, cosine distance measures how different they are:

cosine_distance = 1 - cosine_similarity

Cosine Similarity	Cosine Distance	Meaning
1.0	0.0	Identical
0.8	0.2	Very similar
0.5	0.5	Somewhat related
0.0	1.0	Unrelated

Many vector databases and search systems use cosine distance because they are optimized to find the nearest (smallest distance) neighbors.

Cosine Similarity vs Euclidean Distance

Two common ways to compare vectors are cosine similarity and Euclidean distance. They answer different questions:

Property	Cosine Similarity	Euclidean Distance
Measures	Direction (angle)	Absolute position (straight-line gap)
Affected by magnitude	No	Yes
Range	-1 to 1	0 to infinity
Best when	Comparing meaning or content	Comparing absolute quantities
Used for	Embeddings, NLP, recommendations	Clustering, image pixels, physical data

Rule of thumb: Use cosine similarity when you care about what kind of thing something is. Use Euclidean distance when you care about how much of something there is.

AI Context: The Standard for Comparing Embeddings

Cosine similarity is the default metric for comparing embeddings across nearly all of modern AI:

Word embeddings: Word2Vec and GloVe use cosine similarity to measure word relatedness
Sentence embeddings: Models like BERT produce vectors where cosine similarity captures semantic meaning
Search engines: Queries and documents are embedded as vectors, then ranked by cosine similarity
Recommendation systems: User preferences and item features are compared via cosine similarity
LLM retrieval: When an LLM needs to find relevant context, it uses cosine similarity on vector embeddings

When someone says "these two texts are 92% similar," they almost always mean the cosine similarity of their embedding vectors is 0.92.

Summary

The raw dot product is affected by vector magnitude, making it unreliable for comparing direction alone
Cosine similarity = (a . b) / (|a| * |b|) removes the effect of magnitude
The result ranges from -1 (opposite) to 0 (unrelated) to 1 (identical direction)
Cosine distance = 1 - cosine similarity, used when you need a distance metric
Choose cosine similarity for comparing meaning and Euclidean distance for comparing absolute quantities
Cosine similarity is the standard metric for comparing embeddings in NLP, search, and recommendations

With cosine similarity in our toolkit, we can now measure how similar any two pieces of data are once they have been converted to vectors. In the next lesson, we will see how this simple operation powers real AI applications like semantic search, recommendation systems, and retrieval-augmented generation.