Scaling Deep Learning: Hardware Acceleration, Memory Management, and Distributed (DDP) Architectures
1. System-Level Topology
Modern deep learning training is dominated by large tensor operations such as matrix multiplication, convolution, and attention. GPUs are designed for high-throughput parallel computation, which makes them well suited for neural network training.
At a high level:
CPU: good at general-purpose control logic and sequential work
GPU: good at doing many similar numerical operations in parallel
Deep learning models repeatedly apply the same mathematical operations to large batches of data. This maps naturally to GPU hardware.
flowchart LR
A[Training Data] --> B[Model Forward Pass]
B --> C[Loss Computation]
C --> D[Backward Pass / Gradients]
D --> E[Optimizer Step]
E --> B
2. Why GPUs Are Useful for AI
GPUs are useful for AI because neural networks are mostly built from dense linear algebra.
Common GPU-accelerated operations include:
- matrix multiplication
- convolution
- attention
- normalization
- activation functions
- tensor reshaping and reductions
The most important operation in many models is matrix multiplication:
where:
Xis input dataWis model weightsYis output activations
These operations can be split into many small parallel computations, which GPUs handle efficiently.
3. CPU vs GPU Mental Model
A useful simplified view:
CPUs are strong for:
- operating system tasks
- job orchestration
- data loading
- preprocessing
- control flow
GPUs are strong for:
- tensor math
- training neural networks
- large-batch inference
- matrix multiplication
In deep learning training, the CPU often feeds data to the GPU, while the GPU performs the heavy numerical computation.
flowchart LR
A[Disk / Dataset] --> B[CPU DataLoader]
B --> C[Preprocessing]
C --> D[GPU Tensor Computation]
D --> E[Loss / Gradients]
4. Single-GPU Training
Single-GPU training is the simplest case.
A typical PyTorch workflow looks like:
Conceptually:
flowchart TD
A[Batch of Data] --> B[Move to GPU]
B --> C[Model Forward]
C --> D[Loss]
D --> E[Backward / Gradients]
E --> F[Optimizer Step]
In PyTorch, this usually means:
model = model.to("cuda")
inputs = inputs.to("cuda")
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
Single-GPU training is easy to reason about, but it is limited by:
- GPU memory
- GPU compute capacity
- data loading speed
- training time
5. When One GPU Is Not Enough
Multiple GPUs may be needed when:
- The model is too large for one GPU.
- The batch size is too large for one GPU.
- Training time on one GPU is too slow.
- The dataset is very large.
- The model is a large Transformer or other foundation model.
There are different ways to use multiple GPUs depending on the bottleneck.
6. Types of Multi-GPU Parallelism
The major categories are:
- Data parallelism
- Model parallelism
- Tensor parallelism
- Pipeline parallelism
- Hybrid parallelism
flowchart TD
A[Multi-GPU Training] --> B[Data Parallelism]
A --> C[Model / Tensor Parallelism]
A --> D[Pipeline Parallelism]
A --> E[Hybrid Parallelism]
For many users and many research workflows, data parallelism with PyTorch DistributedDataParallel (DDP) is the most common starting point.
7. Data Parallelism
Data parallelism means:
Example with four GPUs:
GPU0: model copy + batch A
GPU1: model copy + batch B
GPU2: model copy + batch C
GPU3: model copy + batch D
flowchart LR
A[Dataset Batch] --> B[Split Batch]
B --> C[GPU0: batch A]
B --> D[GPU1: batch B]
B --> E[GPU2: batch C]
B --> F[GPU3: batch D]
C --> G[Gradients]
D --> G
E --> G
F --> G
G --> H[Average Gradients]
H --> I[Update All Model Copies]
Each GPU computes gradients based on its own mini-batch. Then gradients are averaged so that all model copies remain synchronized.
8. Effective Batch Size
In data parallel training, the effective batch size increases with the number of GPUs.
Example:
This matters because changing effective batch size can affect optimization behavior. Sometimes the learning rate or training schedule needs adjustment when scaling to many GPUs.
9. PyTorch DataParallel vs DistributedDataParallel
PyTorch has historically offered two common approaches:
torch.nn.DataParalleltorch.nn.parallel.DistributedDataParallel
DataParallel is simpler but generally not recommended for modern multi-GPU training. It uses one Python process and splits work across GPUs, often creating a bottleneck on the primary GPU.
DistributedDataParallel (DDP) is the recommended approach. It uses one process per GPU and performs efficient gradient synchronization.
DataParallel:
1 process → multiple GPUs
DistributedDataParallel:
multiple processes → one GPU per process
flowchart TD
A[Multi-GPU PyTorch] --> B[DataParallel]
A --> C[DistributedDataParallel]
B --> B1[Single process; easier but slower]
C --> C1[One process per GPU; recommended]
10. DDP Core Mental Model
The most useful mental model for DDP is:
Each GPU has one process.
Each process has one model copy.
Each process receives different data.
Gradients are synchronized every training step.
With four GPUs:
Process 0 → GPU0 → model copy → data shard A
Process 1 → GPU1 → model copy → data shard B
Process 2 → GPU2 → model copy → data shard C
Process 3 → GPU3 → model copy → data shard D
flowchart LR
P0[Process 0] --> G0[GPU 0]
P1[Process 1] --> G1[GPU 1]
P2[Process 2] --> G2[GPU 2]
P3[Process 3] --> G3[GPU 3]
G0 --> S[Gradient Synchronization]
G1 --> S
G2 --> S
G3 --> S
11. What torchrun Does
For DDP, you usually do not launch the script with plain python train.py when using multiple GPUs.
Instead, you use:
This launches four Python processes on the node.
torchrun sets environment variables such as:
Meaning:
LOCAL_RANK: process index on the local node, often maps to local GPU IDRANK: global process index across all processesWORLD_SIZE: total number of processes
For one node with four GPUs:
12. Same Node Multi-GPU Training
When all GPUs are on the same physical node, communication happens through local GPU interconnects such as PCIe or NVLink.
Example:
A common Slurm job request might look like:
Inside the job script:
Same-node multi-GPU is generally easier and faster than multi-node training because communication stays within one machine.
13. Multi-Node Multi-GPU Training
Multi-node training extends the same DDP idea across multiple machines.
Example:
The conceptual model remains:
one process per GPU
one model copy per process
different data per process
gradient synchronization across all processes
flowchart TD
subgraph Node1
A1[GPU0/P0]
A2[GPU1/P1]
A3[GPU2/P2]
A4[GPU3/P3]
end
subgraph Node2
B1[GPU0/P4]
B2[GPU1/P5]
B3[GPU2/P6]
B4[GPU3/P7]
end
A1 --> S[Global Gradient Sync]
A2 --> S
A3 --> S
A4 --> S
B1 --> S
B2 --> S
B3 --> S
B4 --> S
The main difference is that synchronization now crosses the network. This makes communication performance much more important.
14. Gradient Synchronization
In DDP, synchronization happens during the backward pass.
Each GPU computes gradients from its own data. Then DDP averages gradients across all processes.
Conceptually:
Then each process applies the same optimizer update.
This keeps all model copies identical.
Important point:
sequenceDiagram
participant G0 as GPU0
participant G1 as GPU1
participant G2 as GPU2
participant G3 as GPU3
G0->>G0: Forward + Backward
G1->>G1: Forward + Backward
G2->>G2: Forward + Backward
G3->>G3: Forward + Backward
G0->>G1: AllReduce gradients
G1->>G2: AllReduce gradients
G2->>G3: AllReduce gradients
G3->>G0: AllReduce gradients
G0->>G0: Optimizer step
G1->>G1: Optimizer step
G2->>G2: Optimizer step
G3->>G3: Optimizer step
15. NCCL
NCCL stands for NVIDIA Collective Communications Library.
It is a communication backend optimized for GPU-to-GPU communication.
PyTorch DDP commonly uses NCCL for GPU training:
NCCL handles collective operations such as:
- AllReduce
- Broadcast
- ReduceScatter
- AllGather
For DDP, the most important concept is AllReduce, which is used to combine gradients across GPUs.
16. AllReduce Intuition
AllReduce is a collective operation where all processes contribute values and all receive the combined result.
For gradient averaging:
This avoids sending everything to one central GPU and helps communication scale better.
flowchart LR
A[GPU0 Gradient] --> E[AllReduce]
B[GPU1 Gradient] --> E
C[GPU2 Gradient] --> E
D[GPU3 Gradient] --> E
E --> F[Average Gradient on GPU0]
E --> G[Average Gradient on GPU1]
E --> H[Average Gradient on GPU2]
E --> I[Average Gradient on GPU3]
17. DistributedSampler
In DDP, each process should see a different slice of the dataset.
Without this, every GPU might train on the same examples, wasting computation.
PyTorch provides DistributedSampler:
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, sampler=sampler)
At the start of each epoch, it is common to call:
This ensures proper shuffling across epochs.
18. Minimal DDP Code Pattern
A simplified DDP structure:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
distributed = "RANK" in os.environ
if distributed:
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
else:
local_rank = 0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
return distributed, local_rank, device
distributed, local_rank, device = setup_distributed()
model = MyModel().to(device)
if distributed:
model = DDP(model, device_ids=[local_rank])
This pattern allows the same script to run in single-GPU or multi-GPU mode.
19. Single GPU as a Special Case
A robust training script can treat single-GPU training as a special case of distributed training.
Typical execution modes:
for single GPU, or:
for four GPUs.
The code can detect whether distributed variables exist and adapt automatically.
This is useful because the same core training script can scale from a local workstation to an HPC node.
20. Slurm and GPU Allocation
On HPC systems, Slurm allocates GPUs as scheduled resources.
A job may request GPUs like this:
or:
Slurm controls which GPUs are visible to the job using environment variables such as:
If a node has four GPUs and a job requests two, Slurm may expose only two GPUs to that job.
Inside the job, PyTorch sees those assigned GPUs as logical device IDs starting from 0.
21. Node Sharing and GPU Allocation
If a node has multiple GPUs, Slurm may allow multiple users to share the node as long as resources do not conflict.
Example:
Both jobs may run on the same node if enough CPU, memory, and GPU resources remain.
If a node has only one GPU and two users each request one GPU, both GPU jobs should not run at the same time on that node. One job should wait until the GPU is free.
A GPU job and a CPU-only job can often share the same node if the cluster policy allows it, because only one job needs the GPU.
22. Example Slurm Script: Single GPU
#!/bin/bash
#SBATCH --job-name=single_gpu_train
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --output=single_gpu_%j.out
module load cuda
conda activate myenv
python train.py
This runs one Python process using one GPU.
23. Example Slurm Script: Single Node, Multi-GPU DDP
#!/bin/bash
#SBATCH --job-name=ddp_single_node
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --output=ddp_%j.out
module load cuda
conda activate myenv
torchrun --nproc_per_node=4 train.py
This launches four processes, one per GPU.
24. Example Slurm Script: Multi-Node DDP Concept
A multi-node DDP script requires coordination between nodes. The exact script depends on cluster configuration, but conceptually it needs:
- number of nodes
- processes per node
- master address
- master port
- node rank
Example structure:
#!/bin/bash
#SBATCH --job-name=ddp_multi_node
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=04:00:00
#SBATCH --output=ddp_multinode_%j.out
module load cuda
conda activate myenv
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_JOB_NUM_NODES
NPROC_PER_NODE=4
NODE_RANK=$SLURM_NODEID
torchrun \
--nnodes=$NNODES \
--nproc_per_node=$NPROC_PER_NODE \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py
Some clusters prefer launching with srun; others use torchrun directly inside the allocation. The best approach depends on site policy.
25. Failure Behavior in DDP
Standard DDP is synchronous and not fault tolerant by default.
This means:
DDP does not automatically remove a bad GPU or redistribute its data to other GPUs.
The usual mitigation is checkpointing.
26. Checkpointing
Checkpointing periodically saves training state so that a failed job can restart from the last saved point.
A useful checkpoint may include:
- model weights
- optimizer state
- scheduler state
- epoch number
- random number generator states
In DDP, usually only rank 0 saves the checkpoint to avoid multiple processes writing the same file.
Example pattern:
When using DDP, the underlying model may be accessed as:
because DDP wraps the original model.
27. Straggler Problem
Because DDP is synchronous, all processes must reach synchronization points together.
If one GPU is slow, all other GPUs wait.
Causes of stragglers include:
- uneven data loading
- slow storage access
- overloaded CPU
- hardware differences
- network issues
- thermal throttling
sequenceDiagram
participant G0 as Fast GPU
participant G1 as Fast GPU
participant G2 as Slow GPU
G0->>G0: Finish backward early
G1->>G1: Finish backward early
G2->>G2: Still computing
G0->>G0: Wait
G1->>G1: Wait
G2->>G2: Finish backward
G0->>G2: Synchronize gradients
G1->>G2: Synchronize gradients
28. Common Bottlenecks
Multi-GPU training can be limited by several bottlenecks.
GPU compute bottleneck
The model is using the GPU heavily and training speed is limited by GPU math throughput.
GPU memory bottleneck
The model or batch size is too large for GPU memory.
Data loading bottleneck
The GPU waits for the CPU or storage system to prepare batches.
Communication bottleneck
GPUs spend too much time synchronizing gradients.
Network bottleneck
In multi-node training, inter-node communication may limit scaling.
29. Common Debugging Scenarios
Only one GPU is being used
Likely causes:
- launched with
python train.pyinstead oftorchrun - model not wrapped in DDP
- code hardcoded to one GPU
Multi-GPU job is slower than expected
Likely causes:
- data loading bottleneck
- too small batch size per GPU
- excessive communication
- slow storage
- imbalance across processes
DDP job hangs
Likely causes:
- one process crashed silently
- mismatch in number of processes
- all processes did not call the same collective operation
- networking or NCCL issue
- different code path on different ranks
Out of memory
Likely fixes:
- reduce batch size
- use mixed precision
- use gradient checkpointing
- use model/tensor parallelism
- reduce sequence length
30. Relationship to OpenMP, MPI, Dask, and Joblib
Traditional HPC often uses MPI and OpenMP.
OpenMP:
MPI:
PyTorch DDP is conceptually closer to MPI because it uses multiple processes and collective communication. However, users usually do not write MPI code directly.
Dask and joblib are higher-level Python tools for parallelism.
A useful mapping:
joblib → high-level local parallelism
dask → high-level distributed task parallelism
DDP → distributed synchronous deep learning training
MPI → lower-level distributed message passing
OpenMP → lower-level shared-memory threading
31. Training vs Inference on GPUs
Training requires:
- forward pass
- loss computation
- backward pass
- gradient synchronization
- optimizer update
Inference usually requires:
- forward pass only
Training is typically much more memory-intensive because it must store activations for backpropagation.
For decoder-only Transformers, inference generates one token at a time, often using a KV cache to avoid recomputing previous attention keys and values.
32. Large Transformer Training and GPUs
Large Transformer training may require more than data parallelism.
Reasons:
- model too large for one GPU
- optimizer states too large
- activation memory too large
- sequence length too long
Additional strategies include:
- mixed precision training
- gradient checkpointing
- tensor parallelism
- pipeline parallelism
- ZeRO optimizer sharding
- fully sharded data parallelism
These techniques reduce memory pressure and/or distribute model components across GPUs.
33. Model Parallelism
Model parallelism splits the model across GPUs.
Example:
This is useful when the model does not fit on one GPU.
flowchart LR
A[Input] --> B[Layers 1-12 on GPU0]
B --> C[Layers 13-24 on GPU1]
C --> D[Output]
The downside is that GPUs may wait on each other unless carefully pipelined.
34. Tensor Parallelism
Tensor parallelism splits individual matrix operations across GPUs.
Instead of putting different layers on different GPUs, a single large layer may be partitioned.
Example:
This is common in very large language model training.
35. Pipeline Parallelism
Pipeline parallelism splits model layers into stages and sends micro-batches through them.
flowchart LR
A[Microbatch 1] --> B[Stage 1 / GPU0]
B --> C[Stage 2 / GPU1]
C --> D[Stage 3 / GPU2]
D --> E[Stage 4 / GPU3]
It improves utilization compared with simple layer splitting, but introduces scheduling complexity and pipeline bubbles.
36. Hybrid Parallelism
Large-scale training often combines multiple methods:
For example:
- data parallelism across groups of nodes
- tensor parallelism within a node
- pipeline parallelism across layers
- optimizer states sharded across processes
This is common for very large Transformer models.
37. Practical Mental Models
Single GPU
Single Node Multi-GPU DDP
Multi-Node DDP
Data Parallelism
Model Parallelism
Pipeline Parallelism
38. Key Terms Glossary
GPU: Graphics Processing Unit; accelerates parallel tensor computation.
CUDA: NVIDIA programming platform for GPU computing.
CUDA_VISIBLE_DEVICES: Environment variable controlling which GPUs are visible to a process.
DDP: DistributedDataParallel; PyTorch's recommended multi-GPU training approach.
Rank: Global ID of a distributed process.
Local rank: Process ID within a node, usually used to select the local GPU.
World size: Total number of distributed processes.
NCCL: NVIDIA communication library used for GPU collectives.
AllReduce: Collective operation that combines values across processes and returns the result to all processes.
DistributedSampler: PyTorch sampler that partitions data across DDP processes.
Effective batch size: Batch size per GPU multiplied by number of GPUs.
Straggler: A slow process or GPU that causes others to wait.
Checkpoint: Saved training state used for recovery or resuming training.
Data parallelism: Replicate model, split data.
Model parallelism: Split model across GPUs.
Tensor parallelism: Split individual tensor operations across GPUs.
Pipeline parallelism: Split layers into stages and stream micro-batches.
Mixed precision: Use lower precision such as FP16/BF16 to reduce memory and improve speed.
39. Summary
GPU training starts with a simple idea: use GPUs for parallel tensor computation. Single-GPU training is straightforward, but scaling to multiple GPUs requires coordinating multiple devices.
The most common modern approach is PyTorch DistributedDataParallel, where each GPU is controlled by its own process, each process trains on different data, and gradients are synchronized every training step using NCCL collectives such as AllReduce.
Multi-GPU training can happen within one node or across many nodes. The mental model remains similar, but multi-node training introduces network communication and more failure modes. DDP is synchronous, so slow or failed workers affect the whole job. Practical robustness comes from good resource allocation, balanced data loading, monitoring, and checkpointing.
For very large models, data parallelism may not be enough. Techniques such as model parallelism, tensor parallelism, pipeline parallelism, mixed precision, and sharding are used to fit and train models at scale.
40. One-Line Summary
GPU training accelerates deep learning by parallelizing tensor operations, and multi-GPU training scales this by distributing work across GPUs while synchronizing model updates through mechanisms such as PyTorch DDP and NCCL AllReduce.