index

Foundations of AI: From Vectors to Neural Networks

Disclaimer: I am still learning some of the concepts covered in this article, and it was written with the help of AI. As a result, it may contain inaccuracies or mistakes. If you spot any errors, feel free to reach out so I can correct them.

Modern artificial intelligence is built on a mathematical foundation that might seem intimidating at first, but whose core concepts are surprisingly accessible. This comprehensive guide will take you through each building block, showing how they connect and culminate in the sophisticated architectures powering today’s AI systems.

The Journey in Brief

Imagine starting with a simple language for representing information—vectors, which are just ordered lists of numbers, and matrices, which transform these lists from one form to another. These mathematical objects became our toolkit for representing everything: images as pixel vectors, words as meaning vectors, and entire sequences as matrix transformations. Each dot product between vectors tells us how similar two things are—like comparing “cat” and “dog”—and each matrix multiplication reshapes information, letting us build increasingly sophisticated transformations step by step.

With this foundation, we discovered how to make these mathematical systems learn by themselves. Neural networks emerged as chains of these transformations, where each layer learns increasingly abstract patterns: early layers might detect edges, while later layers recognize entire objects. The magic of learning happens through gradients—mathematical compasses that point in the direction of improvement. By following these gradients downward (gradient descent), the network continually adjusts its millions of parameters to reduce its errors, gradually discovering patterns no human could explicitly program.

But there was still a missing piece: how to represent discrete symbols like words in a way that captured their meaning. This led to embeddings—learning to map words into dense vectors where similar meanings cluster together in space. Suddenly, arithmetic on words revealed relationships: “king” minus “man” plus “woman” equals approximately “queen.” This breakthrough unlocked attention mechanisms, which let models dynamically focus on different parts of their input when producing each piece of output—like paying attention to “cat” when translating “gato,” rather than trying to remember everything in one fixed summary.

Finally, attention evolved into transformers, which revolutionized AI by processing entire sequences in parallel through self-attention. Each position queries the others (“what information do I need?”), receives responses weighted by relevance, and combines them into rich, context-aware representations. This elegant architecture—built entirely from the simple mathematics of vectors, matrices, gradients, and attention—powers today’s most capable AI systems, from language models that write poetry to image generators that create art. The journey from basic linear algebra to attention-based transformers reveals a beautiful story: profound capabilities emerging from carefully composed mathematical foundations.

We’ll progress through four phases:

Mathematical Foundations: Linear algebra, probability, and calculus
Optimization: Loss functions and gradient descent
Neural Networks: Perceptrons, activations, and backpropagation
Advanced Architectures: Attention, transformers, and embeddings

By the end, you’ll have a complete understanding of how these pieces fit together to create systems that can learn from data and perform remarkable tasks.

Phase 1: Introduction

Introduction

Modern AI and deep learning are built on a foundation of mathematics that might seem intimidating at first, but the core concepts are surprisingly accessible. This series will guide you through the mathematical foundations, showing how each piece connects to the next, until we arrive at the sophisticated architectures powering today’s AI systems.

Phase 1: The Mathematical Language

Notation and Vocabulary

Before diving into the math, let’s establish a common vocabulary. In machine learning and neural networks, you’ll frequently encounter:

Scalars: Single numbers (e.g., $x = 5$ )
Vectors: Ordered lists of numbers, denoted as $\mathbf{v} = [v_1, v_2, \dots, v_n]$
Matrices: 2D arrays of numbers, denoted as $\mathbf{A}$
Functions: Mappings from inputs to outputs, $f: \mathbb{R}^n \to \mathbb{R}^m$
Parameters: Values we learn during training (weights, biases)

This notation becomes our shared language for describing computations in neural networks.

Linear Algebra: Vectors and Matrices

Linear algebra is the workhorse of machine learning. Every computation in a neural network can be expressed as operations on vectors and matrices.

Vectors represent data points. For example, an image with 784 pixels can be represented as a vector $\mathbf{x} \in \mathbb{R}^{784}$ .

Matrices represent transformations. A weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ transforms an $n$ -dimensional input into an $m$ -dimensional output.

Key operations:

Vector addition: $\mathbf{u} + \mathbf{v} = [u_1+v_1, u_2+v_2, \dots]$
Scalar multiplication: $c\mathbf{v} = [cv_1, cv_2, \dots]$
Dot product: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$
Matrix multiplication: $(\mathbf{A}\mathbf{B})_{ij} = \sum_k A_{ik} B_{kj}$

The dot product is particularly important—it measures similarity between vectors and is the fundamental operation in neural networks.

Why it matters: Neural networks transform data through sequences of linear transformations (matrix multiplications). Understanding these operations helps us understand what the network is learning.

Basic Probability: Random Variables and Distributions

Neural networks operate in a world of uncertainty. Probability theory gives us tools to model and reason about this uncertainty.

Random Variables: Variables whose values depend on random outcomes. For example, $X$ could represent the pixel values in an image, which vary randomly based on the image content.

Distributions: Functions that describe the likelihood of different outcomes:

Bernoulli distribution: $P(X=1) = p, P(X=0) = 1-p$ (binary outcomes)
Gaussian (Normal) distribution: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ (bell curve)

Key concepts:

Expected value: $\mathbb{E}[X] = \sum_x x \cdot P(X=x)$ (the average value)
Variance: $\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$ (spread of the distribution)
Independence: Two events are independent if $P(A \cap B) = P(A)P(B)$

Why it matters: Many neural network outputs are probabilistic (e.g., classification probabilities). We also use probability to measure uncertainty, regularize models, and generate diverse outputs.

Derivatives and Gradients

To train neural networks, we need to optimize functions. Derivatives tell us how functions change.

Derivative: The rate of change of a function at a point: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

For a function $f(x) = x^2$ , the derivative is $f'(x) = 2x$ . This tells us that at $x=3$ , the function is increasing at a rate of 6.

Gradient: The generalization of derivatives to multivariable functions. For $f: \mathbb{R}^n \to \mathbb{R}$ , the gradient is:

$\nabla f(\mathbf{x}) = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n}\right]$

The gradient points in the direction of steepest ascent—this is crucial for optimization.

Example: For $f(x, y) = x^2 + y^2$ , $\nabla f = [2x, 2y]$ . At point $(1, 2)$ , the gradient is $[2, 4]$ , pointing in the direction of fastest increase.

Chain Rule: When functions are composed, we can compute derivatives using:

$\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)$

This rule is the mathematical backbone of backpropagation, which we’ll cover in Phase 3.

Why it matters: Neural networks learn by adjusting parameters to minimize a loss function. The gradient tells us which direction to adjust each parameter to reduce the loss.

Connecting the Dots

These three pillars—linear algebra, probability, and calculus—work together in neural networks:

Linear algebra provides the structure for representing and transforming data
Probability models uncertainty and guides learning objectives
Calculus (gradients) enables us to optimize the network

In the next phase, we’ll see how these concepts combine in gradient descent and loss functions, setting the stage for understanding how neural networks actually learn.

Phase 2:

In Phase 1, we built the mathematical toolkit: linear algebra for data transformation, probability for modeling uncertainty, and calculus for understanding change. Now we’ll see how these tools combine to create the learning mechanism at the heart of all neural networks.

Loss Functions: Measuring Error

To learn, a neural network needs to know how well it’s doing. A loss function (or cost function) quantifies the error between predictions and true values.

Common Loss Functions

Mean Squared Error (MSE): For regression problems

$\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

Where $y_i$ is the true value and $\hat{y}_i$ is the predicted value. The difference is squared to penalize large errors more heavily.

Cross-Entropy Loss: For classification problems

$\mathcal{L}_{\text{CE}} = -\sum_{i=1}^C y_i \log(\hat{y}_i)$

Where $C$ is the number of classes, $y_i$ is the true label (one-hot encoded), and $\hat{y}_i$ is the predicted probability. This loss heavily penalizes confident wrong predictions.

Example: If we’re classifying images as “cat”, “dog”, or “bird”, and the true label is “cat” ( $y = [1, 0, 0]$ ), a prediction of $\hat{y} = [0.9, 0.05, 0.05]$ would have low loss, while $\hat{y} = [0.1, 0.8, 0.1]$ would have high loss.

Why it matters: The loss function is our objective. Everything in training is about minimizing this function. A good loss function accurately reflects what we care about in the real world.

Gradient Descent: The Learning Algorithm

Once we have a loss function, how do we minimize it? The answer is gradient descent.

Intuition

Imagine you’re on a mountain in thick fog. You want to reach the lowest point (the valley), but you can’t see the terrain. What do you do?

Feel the ground around you to determine which direction slopes downward
Take a step in that direction
Repeat until you reach the bottom

This is gradient descent: the loss landscape is our mountain, the gradient points downhill, and each step is a parameter update.

The Algorithm

For a parameter vector $\boldsymbol{\theta}$ (all weights and biases in the network):

Compute gradient:

$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$

Update parameters:

$\boldsymbol{\theta}_{\text{new}} = \boldsymbol{\theta} - \eta \cdot \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$

Repeat until convergence

The learning rate $\eta$ (eta) controls step size:

Too small: Learning is very slow
Too large: We might overshoot or diverge

Example: One-Dimensional Case

Let $L(\theta) = (\theta - 3)^2$ . We want to find $\theta$ that minimizes this.

The gradient is: $\frac{dL}{d\theta} = 2(\theta - 3)$

Starting at $\theta = 0$ with learning rate $\eta = 0.1$ :

Step 1: Gradient $= 2(0 - 3) = -6$ , update: $\theta_{\text{new}} = 0 - 0.1(-6) = 0.6$
Step 2: Gradient $= 2(0.6 - 3) = -4.8$ , update: $\theta_{\text{new}} = 0.6 - 0.1(-4.8) = 1.08$
Step 3: Gradient $= 2(1.08 - 3) = -3.84$ , update: $\theta_{\text{new}} = 1.08 - 0.1(-3.84) = 1.468$

We’re moving toward $\theta = 3$ , the true minimum!

Stochastic Gradient Descent (SGD)

In practice, we use stochastic gradient descent: instead of computing the gradient over all training data, we use a small batch (mini-batch).

$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) \approx \frac{1}{|B|} \sum_{i \in B} \nabla_{\boldsymbol{\theta}} \mathcal{L}_i(\boldsymbol{\theta})$

Where $B$ is a random batch of training examples.

Why SGD?

Computationally efficient (don’t need all data for each update)
Noisy updates help escape local minima
Works better with large datasets

Advanced Optimizers

Modern training rarely uses vanilla SGD. Popular variants include:

Momentum: Accumulates past gradients to accelerate through flat regions
Adam: Combines momentum with adaptive learning rates
RMSprop: Adapts learning rates per parameter

These build on the same gradient descent principle but use the gradient more intelligently.

The Importance of Gradients in Network Training

Gradients are the lifeblood of neural network training. Let’s understand why.

Gradients Guide Learning

Each gradient component $\frac{\partial \mathcal{L}}{\partial \theta_i}$ tells us:

How much the loss would change if we increased $\theta_i$ by a tiny amount
Which direction to move $\theta_i$ to reduce loss

A large positive gradient means “decrease this parameter to reduce loss,” while a large negative gradient means “increase this parameter to reduce loss.”

Vanishing and Exploding Gradients

Two common problems in deep networks:

Vanishing gradients: In very deep networks, gradients can become extremely small, making early layers learn very slowly. This was a major challenge in early neural networks.

Exploding gradients: Gradients can become extremely large, causing parameters to change drastically and training to become unstable.

Solutions include:

Careful initialization (e.g., Xavier, He initialization)
Activation functions with well-behaved gradients (we’ll see these in Phase 3)
Gradient clipping
Residual connections (skip connections)

Gradient Flow Through the Network

In a deep network, gradients flow backward from the output to the input:

1
Input → [Layer 1] → [Layer 2] → ... → [Output]
2
              ↑           ↑           ↑
3
           Gradient   Gradient   Gradient (starts here)

If gradients vanish early, the early layers (which often learn fundamental features) don’t get useful training signals. If gradients explode, training becomes unstable.

Why it matters: Understanding gradient flow helps us design better architectures. Modern breakthroughs like residual networks were motivated specifically to improve gradient flow.

Putting It All Together

We now have a complete picture of the learning loop:

Forward pass: Compute predictions using current parameters
Compute loss: Measure error using a loss function
Backward pass: Compute gradients of the loss with respect to all parameters
Update parameters: Move parameters in the opposite direction of gradients
Repeat

This loop, applied thousands or millions of times, is how neural networks learn from data. In Phase 3, we’ll dive into the neural network architecture itself—how neurons, layers, and activation functions work together to create powerful function approximators.

Phase 3:

In Phases 1 and 2, we covered the mathematical machinery: linear algebra for transformations, probability for modeling, and gradient descent for optimization. Now we’ll see how these pieces assemble into neural networks—the function approximators that can learn complex patterns from data.

What is a Neural Network?

At its core, a neural network is a mathematical function composed of simpler functions. It takes inputs, applies a series of transformations, and produces outputs. The “magic” is that the transformations are parameterized, and we learn those parameters from data.

Mathematically: A neural network computes $f(\mathbf{x}; \boldsymbol{\theta})$ where $\mathbf{x}$ is the input and $\boldsymbol{\theta}$ represents all learnable parameters.

Conceptually: It’s a computational graph where nodes perform operations and edges represent data flow.

The Perceptron: The Building Block

The perceptron is the simplest neural network unit—a single artificial neuron.

Structure

A perceptron takes $n$ inputs and produces one output:

1
Inputs: x₁, x₂, ..., xₙ
2
       ↓   ↓       ↓
3
Weights: w₁, w₂, ..., wₙ
4
       ↓   ↓       ↓
5
       Sum: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
6
                              ↓
7
                         Activation: a = σ(z)
8
                              ↓
9
                           Output: y = a

Mathematically:

$z = \sum_{i=1}^n w_i x_i + b = \mathbf{w} \cdot \mathbf{x} + b$

$y = \sigma(z)$

Where:

$\mathbf{w}$ is the weight vector
$b$ is the bias (allows shifting the activation function)
$\sigma$ is an activation function
$z$ is called the “pre-activation”

Geometric Intuition

The computation $z = \mathbf{w} \cdot \mathbf{x} + b$ is a hyperplane in $\mathbb{R}^n$ . The perceptron decides which side of this hyperplane the input falls on.

For 2D inputs $(x_1, x_2)$ :

The decision boundary is the line $w_1 x_1 + w_2 x_2 + b = 0$
Points above the line have $z > 0$ , below have $z < 0$

Limitations of Single Perceptrons

A single perceptron can only learn linearly separable functions. It cannot solve the XOR problem, which requires a non-linear decision boundary.

Solution: Stack multiple perceptrons to create a multi-layer perceptron (MLP).

Feed-Forward Networks

A feed-forward network (or multi-layer perceptron) consists of layers of perceptrons connected sequentially.

Architecture

1
Input Layer    Hidden Layer 1    Hidden Layer 2    Output Layer
2
  [x₁] ──────►    [h₁₁] ──────►    [h₂₁] ──────►    [y₁]
3
  [x₂] ──────►    [h₁₂] ──────►    [h₂₂] ──────►    [y₂]
4
  [x₃] ──────►    [h₁₃] ──────►    [h₂₃] ──────►    [y₃]
5
                 ...                 ...

Each layer transforms the representation: $\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$

Where $\mathbf{h}^{(0)} = \mathbf{x}$ (input) and $\mathbf{h}^{(L)} = \mathbf{y}$ (output).

Why multiple layers?: Each layer learns increasingly abstract features. In image recognition, early layers might detect edges, middle layers detect shapes, and later layers detect objects.

Universal Approximation Theorem

A remarkable theoretical result: A feed-forward network with a single hidden layer and non-linear activation can approximate any continuous function to arbitrary accuracy (given enough neurons).

This means neural networks are incredibly flexible function approximators—they can learn virtually any mapping from inputs to outputs given sufficient capacity.

Activation Functions: Adding Non-Linearity

Without activation functions, a multi-layer network would just be a series of linear transformations, which could be collapsed into a single linear transformation. Activation functions introduce non-linearity, enabling the network to learn complex, non-linear patterns.

Sigmoid

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Properties:

Output in $(0, 1)$ (interpretable as probability)
Smooth gradient: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
Historically popular but suffers from vanishing gradients

Use case: Binary classification output

ReLU (Rectified Linear Unit)

$\text{ReLU}(z) = \max(0, z)$

Properties:

Simple: $f(z) = 0$ if $z < 0$ , else $f(z) = z$
Sparse activation (many neurons output zero)
Avoids vanishing gradient (gradient is 1 for positive inputs)
Computationally efficient

Use case: Hidden layers in most modern networks

Why ReLU Won

ReLU’s simplicity and lack of vanishing gradient problems made it the default choice for hidden layers. Variants like Leaky ReLU and ELU address the “dying ReLU” problem (neurons that never activate).

The Forward Pass: Making Predictions

The forward pass computes the network’s output given inputs.

Step-by-step:

Input $\mathbf{x}$ enters the first layer
For each layer $l = 1$ to $L$ : a. Compute pre-activation: $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ b. Apply activation: $\mathbf{h}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$
Output $\mathbf{h}^{(L)}$ is the prediction

Example: For a network with one hidden layer:

1
# Forward pass
2
z_hidden = W_hidden @ x + b_hidden
3
h_hidden = relu(z_hidden)
4
z_output = W_output @ h_hidden + b_output
5
y_pred = sigmoid(z_output)  # For binary classification

Backpropagation: Learning from Mistakes

Backpropagation is the algorithm that computes gradients efficiently for neural networks. It applies the chain rule recursively from the output to the input.

The Chain Rule in Action

For a simple network with one hidden layer:

$\mathcal{L} = \mathcal{L}(y, \hat{y}) = \mathcal{L}(y, \sigma(\mathbf{W}^{(2)} \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}))$

To find $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}}$ , we chain through the computation:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \hat{y}}}_{\text{Output layer}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial \mathbf{z}^{(2)}}}_{\text{Output activation}} \cdot \underbrace{\frac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{h}^{(1)}}}_{\text{Hidden to output}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{z}^{(1)}}}_{\text{Hidden activation}} \cdot \underbrace{\frac{\partial \mathbf{z}^{(1)}}{\partial \mathbf{W}^{(1)}}}_{\text{Input to hidden}}$

Simplified Backprop Algorithm

Forward pass: Store all intermediate values ( $\mathbf{z}^{(l)}$ , $\mathbf{h}^{(l)}$ )
Output gradient: Compute $\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}$
Backward pass: For each layer $l = L-1$ to $1$ : $\delta^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l+1)}} \cdot \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{z}^{(l)}}$
Parameter gradients: $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^T$ $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}$

Computational Efficiency

The key insight of backpropagation is that we compute gradients in $O(N)$ time where $N$ is the number of parameters, rather than $O(N^2)$ for naive numerical differentiation. This makes training large networks feasible.

Example: Computing a Gradient

For a single neuron with input $x$ , weight $w$ , bias $b$ , sigmoid activation, and MSE loss:

Forward: $z = wx + b$ , $a = \sigma(z)$ , $\mathcal{L} = \frac{1}{2}(a - y)^2$
Backward: $\frac{\partial \mathcal{L}}{\partial a} = a - y$ $\frac{\partial a}{\partial z} = a(1 - a)$ $\frac{\partial z}{\partial w} = x$ $\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = (a - y) \cdot a(1 - a) \cdot x$

This tells us exactly how to adjust $w$ to reduce the loss.

Putting It Together: Training Loop

We now have all the pieces:

1
# Initialize parameters
2
W_hidden, b_hidden = initialize_weights()
3
W_output, b_output = initialize_weights()
4

5
for epoch in range(num_epochs):
6
    for batch in data_loader:
7
        # Forward pass
8
        z_hidden = W_hidden @ x + b_hidden
9
        h_hidden = relu(z_hidden)
10
        z_output = W_output @ h_hidden + b_output
11
        y_pred = sigmoid(z_output)
12

13
        # Compute loss
14
        loss = binary_cross_entropy(y, y_pred)
15

16
        # Backward pass (backprop)
17
        grad_output = (y_pred - y) * y_pred * (1 - y_pred)
18
        grad_W_output = grad_output @ h_hidden.T
19
        grad_b_output = grad_output
20

21
        grad_hidden = (W_output.T @ grad_output) * (z_hidden > 0)
22
        grad_W_hidden = grad_hidden @ x.T
23
        grad_b_hidden = grad_hidden
24

25
        # Update parameters (gradient descent)
26
        W_output -= learning_rate * grad_W_output
27
        b_output -= learning_rate * grad_b_output
28
        W_hidden -= learning_rate * grad_W_hidden
29
        b_hidden -= learning_rate * grad_b_hidden

In Phase 4, we’ll explore how these basic networks evolved into the sophisticated architectures powering modern AI, including attention mechanisms and transformers.

Phase 4:

We’ve covered the foundations: linear algebra, probability, calculus, gradient descent, and basic neural networks. Now we arrive at the architectures that have revolutionized AI—attention mechanisms and transformers. These concepts, combined with embeddings, power models like GPT, BERT, and the current generation of AI systems.

Embeddings: Representing Meaning as Vectors

Before understanding attention, we need to understand how we represent discrete inputs (like words) as continuous vectors that neural networks can process.

From One-Hot to Dense Vectors

One-hot encoding: Represent each word as a sparse vector where only one position is 1.

1
"cat"    = [1, 0, 0, 0, ...]
2
"dog"    = [0, 1, 0, 0, ...]
3
"bird"   = [0, 0, 1, 0, ...]

Problems: High dimensionality, no notion of similarity, no relationship between words.

Word embeddings: Learn dense, low-dimensional vectors where semantic similarity corresponds to geometric similarity.

1
"cat"    ≈ [0.2, -0.5, 0.8, 0.1, ...]
2
"dog"    ≈ [0.3, -0.4, 0.7, 0.2, ...]
3
"bird"   ≈ [0.9, 0.3, -0.2, 0.5, ...]

Now “cat” and “dog” have similar vectors (they’re semantically related), while “bird” is different.

Why Embeddings Work

Words that appear in similar contexts should have similar meanings. By learning to predict words from context (or vice versa), we automatically capture semantic relationships.

Famous property: Vector arithmetic captures relationships:

1
king - man + woman ≈ queen

The direction from “man” to “woman” is similar to the direction from “king” to “queen”!

Training Embeddings

Word2Vec: Two approaches:

Skip-gram: Predict context words from target word
CBOW: Predict target word from context

GloVe: Factorize word co-occurrence matrix

Modern approach: Learn embeddings as part of the model training rather than pre-training separately.

Beyond Words

Embeddings extend beyond words:

Character embeddings: For subword information
Positional embeddings: For sequence order
Sentence embeddings: For entire documents
Multimodal embeddings: For images, audio, etc.

Attention: What It Is and Why It Matters

Attention mechanisms allow models to focus on relevant parts of the input when producing each part of the output. This was a breakthrough that solved limitations of sequence-to-sequence models.

The Problem with Fixed Representations

Before attention, sequence models (like RNNs) compressed an entire input sequence into a single fixed-size vector:

1
Input: "The cat sat on the mat"
2
↓ (encode entire sentence)
3
Fixed vector: [0.1, -0.3, 0.7, ...]
4
↓ (decode)
5
Output: "El gato se sentó en la alfombra"

Problem: All information must be compressed into one vector, which becomes a bottleneck for long sequences.

The Attention Solution

Instead of using a fixed vector, let each output step look at different parts of the input, weighted by relevance.

1
When generating "El"    → Focus on "The"
2
When generating "gato" → Focus on "cat"
3
When generating "se"    → Focus on "sat"

Mathematically, for output position $i$ , we compute a weighted sum of input representations:

$\mathbf{c}_i = \sum_{j} \alpha_{ij} \mathbf{h}_j$

Where $\alpha_{ij}$ is the attention weight from output position $i$ to input position $j$ .

Computing Attention Weights

How do we compute $\alpha_{ij}$ ? We use a compatibility function that measures how well input position $j$ relates to output position $i$ .

Dot-product attention:

$e_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j$

$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$

Where:

$\mathbf{q}_i$ is the query vector (what we’re looking for)
$\mathbf{k}_j$ is the key vector (what the input offers)
$\mathbf{h}_j$ is the value vector (the actual content)

The softmax ensures weights sum to 1 (probability distribution).

Query, Key, Value (Q, K, V) Explained

The Q, K, V framework is the core of modern attention mechanisms. Let’s break it down with an intuitive example.

Intuition: Database Analogy

Think of it like a database query:

Query (Q): What you’re searching for
Key (K): How items are indexed
Value (V): The actual content

1
Database:
2
  Key: "name", Value: "Alice"
3
  Key: "age", Value: 30
4
  Key: "city", Value: "NYC"
5

6
Query: "name"
7
→ Match: Key "name" matches Query "name"
8
→ Return: Value "Alice"

In Neural Networks

Each input position has its own K and V. Each output position has its own Q. We compute similarity between Q and all K’s, then weight V’s by these similarities.

Q, K, V in Code

1
def scaled_dot_product_attention(Q, K, V):
2
    """
3
    Args:
4
        Q: (batch_size, seq_len, d_k)
5
        K: (batch_size, seq_len, d_k)
6
        V: (batch_size, seq_len, d_v)
7
    Returns:
8
        output: (batch_size, seq_len, d_v)
9
        attention_weights: (batch_size, seq_len, seq_len)
10
    """
11
    # Compute similarity scores
12
    scores = Q @ K.transpose(-2, -1)  # (batch_size, seq_len, seq_len)
13

14
    # Scale to prevent extremely large softmax values
15
    scores = scores / np.sqrt(K.shape[-1])
16

17
    # Apply softmax to get attention weights
18
    attention_weights = softmax(scores, axis=-1)
19

20
    # Weight values by attention weights
21
    output = attention_weights @ V
22

23
    return output, attention_weights

Key steps:

Compute Q·K to get similarity scores
Scale by $\sqrt{d_k}$ (stabilizes gradients)
Apply softmax to get probability distribution
Weight sum of V’s

Multi-Head Attention

Instead of one set of Q, K, V, we use multiple “heads” that learn different relationships:

1
class MultiHeadAttention:
2
    def __init__(self, d_model, num_heads):
3
        self.d_model = d_model
4
        self.num_heads = num_heads
5
        self.d_k = d_model // num_heads
6

7
        # Learnable projection matrices
8
        self.W_Q = initialize_weights((d_model, d_model))
9
        self.W_K = initialize_weights((d_model, d_model))
10
        self.W_V = initialize_weights((d_model, d_model))
11
        self.W_O = initialize_weights((d_model, d_model))
12

13
    def forward(self, x):
14
        batch_size, seq_len, _ = x.shape
15

16
        # Project to Q, K, V for each head
17
        Q = x @ self.W_Q  # (batch, seq_len, d_model)
18
        K = x @ self.W_K
19
        V = x @ self.W_V
20

21
        # Split into heads and reshape
22
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.d_k)
23
        K = K.reshape(batch_size, seq_len, self.num_heads, self.d_k)
24
        V = V.reshape(batch_size, seq_len, self.num_heads, self.d_k)
25

26
        # Transpose to (batch, num_heads, seq_len, d_k)
27
        Q = Q.transpose(1, 2)
28
        K = K.transpose(1, 2)
29
        V = V.transpose(1, 2)
30

31
        # Compute attention for each head
32
        scores = Q @ K.transpose(-2, -1) / np.sqrt(self.d_k)
33
        attention_weights = softmax(scores, axis=-1)
34
        attended = attention_weights @ V
35

36
        # Concatenate heads and project
37
        attended = attended.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
38
        output = attended @ self.W_O
39

40
        return output, attention_weights

Each head can focus on different types of relationships:

Head 1 might focus on syntactic structure
Head 2 might focus on semantic similarity
Head 3 might focus on positional relationships

Transformers: The Attention-Only Architecture

Transformers replace recurrence (RNNs) and convolution (CNNs) entirely with attention mechanisms.

Key Innovations

1. Self-attention: Instead of attention between encoder and decoder, each position attends to all other positions in the same sequence.

1
For each word "cat", compute attention weights with all words:
2
"cat" attends to: "The"(0.1), "cat"(0.6), "sat"(0.2), "on"(0.05), "the"(0.03), "mat"(0.02)

2. Positional encoding: Since attention doesn’t inherently capture order, we add position information:

$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$

$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$

These sinusoidal encodings allow the model to learn relative positions.

3. Parallelization: Unlike RNNs which process sequentially, transformers process all positions simultaneously, enabling massive parallelization on GPUs.

Transformer Architecture

1
Input Embedding + Positional Encoding
2
           ↓
3
    ┌─────────────────┐
4
    │  Encoder Stack  │
5
    │  (N times)      │
6
    │                 │
7
    │  ┌───────────┐  │
8
    │  │ Multi-Head│  │
9
    │  │ Attention │  │
10
    │  └───────────┘  │
11
    │        ↓         │
12
    │  ┌───────────┐  │
13
    │  │ Feed-     │  │
14
    │  │ Forward   │  │
15
    │  └───────────┘  │
16
    └─────────────────┘
17
           ↓
18
    ┌─────────────────┐
19
    │  Decoder Stack  │
20
    │  (N times)      │
21
    │                 │
22
    │  ┌───────────┐  │
23
    │  │ Masked    │  │
24
    │  │ Multi-Head│  │
25
    │  │ Attention │  │
26
    │  └───────────┘  │
27
    │        ↓         │
28
    │  ┌───────────┐  │
29
    │  │ Cross     │  │
30
    │  │ Attention│  │
31
    │  └───────────┘  │
32
    │        ↓         │
33
    │  ┌───────────┐  │
34
    │  │ Feed-     │  │
35
    │  │ Forward   │  │
36
    │  └───────────┘  │
37
    └─────────────────┘
38
           ↓
39
        Output

Encoder: Processes input sequence and produces contextual representations

Decoder: Generates output sequence, attending to both encoder output and previously generated tokens

Masked attention: Prevents decoder from “seeing the future” during training (autoregressive)

Why Transformers Work So Well

Long-range dependencies: Attention connects any two positions directly, no matter how far apart
Parallelization: Process entire sequences at once, enabling training on massive datasets
Interpretability: Attention weights show what the model focuses on
Scalability: Performance continues to improve with more data and compute

From Transformers to LLMs

Modern large language models are essentially:

Decoder-only transformers (for autoregressive text generation)
Trained on massive text datasets
Scaled up with more parameters, more data, and more compute

The architecture we’ve discussed is essentially the same as GPT, BERT, and LLaMA—just scaled up.

The Complete Picture

We’ve now traced the complete journey:

Linear algebra: Vectors and matrices represent and transform data
Probability: Models uncertainty and defines learning objectives
Calculus: Gradients guide parameter updates
Gradient descent: The optimization algorithm that learns parameters
Neural networks: Parameterized function approximators composed of layers
Perceptrons: Building blocks with weights, biases, and activations
Forward/backprop: Making predictions and computing gradients
Activations: ReLU and sigmoid introduce non-linearity
Embeddings: Discrete symbols become meaningful continuous vectors
Attention: Models dynamically focus on relevant information
Q/K/V: Framework for computing relevance and importance
Transformers: Attention-based architectures that process sequences
Modern AI: Scale these principles to achieve remarkable capabilities

Each concept builds on the previous ones, creating a mathematical foundation that enables machines to learn from data and perform increasingly sophisticated tasks. Understanding these foundations demystifies AI and provides the intuition to innovate and build the next generation of intelligent systems.

Conclusion

We’ve traced the complete journey from basic mathematical operations to cutting-edge AI architectures. Each concept builds on the previous ones, creating a coherent framework for understanding how machines learn:

Linear algebra provides the language for representing and transforming data
Probability helps us model uncertainty and define what we want to learn
Calculus (gradients) shows us how to improve our models
Gradient descent is the algorithm that drives learning
Neural networks are the flexible function approximators that learn from data
Attention mechanisms allow models to focus on what matters
Transformers leverage attention to process sequences efficiently

This knowledge isn’t just academic—it’s the foundation for building, understanding, and improving AI systems. Whether you’re implementing a neural network, debugging a transformer model, or designing a new architecture, these fundamentals provide the intuition and tools you need.

The next time you use ChatGPT, generate images with DALL-E, or interact with any AI system, you’ll understand the mathematical machinery working beneath the surface. And perhaps more importantly, you’ll be equipped to contribute to the next generation of AI innovations.