Tensor Operations in Deep Learning
Knowing what tensors are is only the beginning. Deep learning is built on operations that reshape, combine, slice, and reduce tensors. These operations allow neural networks to transform raw data into predictions. Every forward pass, every attention computation, and every gradient update is a sequence of tensor operations.
Reshaping: Changing Shape Without Changing Data
Reshaping rearranges a tensor into a new shape while keeping all elements in the same order. The total number of elements must remain the same.
Original: shape (2, 6) -- 12 elements
Reshaped: shape (3, 4) -- still 12 elements
Reshaped: shape (12,) -- flat vector, still 12 elements
In PyTorch, you use tensor.view(3, 4) or tensor.reshape(3, 4). Flattening collapses all dimensions into a single vector -- a 28x28 image becomes a 784-element vector for a fully connected layer. Use -1 to let PyTorch calculate a dimension automatically: images.view(32, -1) produces shape (32, 784).
Broadcasting: Operations on Different Shapes
Broadcasting lets you perform operations between tensors of different shapes by automatically expanding the smaller tensor to match the larger one without copying data.
Tensor A: shape (32, 10) -- batch of 32 vectors
Tensor B: shape (10,) -- single bias vector
A + B: shape (32, 10) -- B broadcast across all 32 rows
Dimensions are compared from right to left; a size-1 or missing dimension is stretched to match. This is how bias addition works in every neural network layer.
Slicing and Indexing
Slicing extracts sub-tensors by selecting ranges along specific axes.
data = tensor of shape (32, 10, 768)
data[0] # First sample: shape (10, 768)
data[:, 0, :] # First token of all: shape (32, 768)
data[:, :5, :] # First 5 tokens: shape (32, 5, 768)
In transformers, extracting the [CLS] token for classification is simply output[:, 0, :].
Element-Wise Operations
Element-wise operations apply the same function to every element independently. The tensors must have the same shape (or be broadcastable).
C = A + B # element-wise addition
C = A * B # element-wise multiplication (Hadamard product)
C = relu(A) # apply ReLU to every element: max(0, x)
Activation functions like ReLU, sigmoid, and GELU are all element-wise operations, introducing the non-linearity that gives neural networks their power.
Reduction Operations
Reductions collapse one or more axes by computing an aggregate like sum, mean, or max.
data = tensor of shape (32, 10)
data.sum() # Sum all elements: scalar
data.sum(dim=0) # Sum across batch: shape (10,)
data.mean(dim=1) # Mean across features: shape (32,)
data.max(dim=1) # Max across features: shape (32,)
The dim parameter specifies which axis to collapse. Reductions appear in loss functions, pooling layers, and normalization.
Concatenation and Stacking
Concatenation joins tensors along an existing axis. Stacking joins them along a new axis.
# Concatenation along axis 1
a = shape (32, 5, 768); b = shape (32, 3, 768)
torch.cat([a, b], dim=1) # shape: (32, 8, 768)
# Stacking creates a new axis
x = shape (10, 768); y = shape (10, 768)
torch.stack([x, y], dim=0) # shape: (2, 10, 768)
Concatenation is used in skip connections (ResNet, U-Net) and when combining features from different sources. Stacking assembles batches from individual samples.
Einsum: Concise Tensor Notation
Einsum (Einstein summation) is a compact notation for expressing a wide range of tensor operations in a single expression.
C = torch.einsum('ij,jk->ik', A, B) # Matrix multiplication
C = torch.einsum('bij,bjk->bik', A, B) # Batch matrix multiplication
s = torch.einsum('i,i->', a, b) # Dot product
B = torch.einsum('ij->ji', A) # Transpose
Each letter labels an axis. Repeated letters are summed over. The arrow specifies the output axes. Einsum replaces separate calls to transpose, multiply, and sum, making complex tensor expressions shorter and clearer.
Common Tensor Shapes in AI Models
Different domains have standard tensor shape conventions.
| Domain | Shape | Example |
|---|---|---|
| NLP (Transformers) | (batch, sequence_length, embedding_dim) | (32, 512, 768) |
| Computer Vision | (batch, channels, height, width) | (32, 3, 224, 224) |
| Tabular Data | (batch, features) | (64, 20) |
| Audio | (batch, channels, frequency, time) | (16, 1, 128, 256) |
Knowing these conventions helps you read model code and debug shape mismatches, one of the most common errors in deep learning.
Practical Examples
Batch normalization combines reductions and broadcasting: compute mean and variance across the batch dimension, then normalize using broadcasting.
mean = data.mean(dim=0) # shape: (features,)
normalized = (data - mean) / (data.var(dim=0) + 1e-5).sqrt() # broadcasting
Attention masks use element-wise operations and broadcasting: a binary tensor of shape (batch, 1, seq_len, seq_len) is added to attention scores to mask out padding tokens with large negative values.
Summary
- Reshaping changes tensor shape without altering data; flattening is the most common case
- Broadcasting automatically expands smaller tensors to match larger ones for element-wise operations
- Slicing extracts sub-tensors along any axis
- Element-wise operations apply functions independently to every element (addition, multiplication, activation functions)
- Reductions collapse axes with sum, mean, or max, used in loss functions and normalization
- Concatenation joins tensors along existing axes; stacking creates new axes
- Einsum provides concise notation for complex tensor operations
- Standard shapes exist for each AI domain: NLP uses (batch, seq, embed), vision uses (batch, channels, height, width)
With tensor structure and operations covered, the final lesson ties everything together by showing how every linear algebra concept from this course appears in the end-to-end AI pipeline.

