{ "cells": [ { "cell_type": "markdown", "id": "29e01e9a", "metadata": {}, "source": [ "# Scaling Up Training with GenTS\n", "\n", "GenTS is built on top of **PyTorch Lightning**, which means you can leverage all of Lightning's built-in features for scaling up training — including **mixed precision**, **Distributed Data Parallel (DDP)**, **gradient clipping**, **callbacks**, and more.\n", "\n", "This tutorial demonstrates how to use these features to speed up training and handle larger-scale experiments." ] }, { "cell_type": "code", "execution_count": 1, "id": "a04bbd08", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/wcx/anaconda3/envs/gents/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "CUDA extension for cauchy multiplication not found. Install by going to extensions/cauchy/ and running `python setup.py install`. This should speed up end-to-end training by 10-50%\n", "Falling back on slow Cauchy kernel. Install at least one of pykeops or the CUDA extension for efficiency.\n", "Falling back on slow Vandermonde kernel. Install pykeops for improved memory efficiency.\n" ] } ], "source": [ "import time\n", "import torch\n", "import pandas as pd\n", "from lightning import Trainer\n", "from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint, Timer\n", "\n", "from gents.dataset import SineND\n", "from gents.model import VanillaDDPM" ] }, { "cell_type": "markdown", "id": "a712d92c", "metadata": {}, "source": [ "## Setup\n", "\n", "We use a **SineND** dataset with moderate size for benchmarking. All experiments use the same model architecture (`VanillaDDPM`) so that we can focus on measuring the effect of different training strategies.\n", "\n", "> **Note:** To observe meaningful speedups from DDP and mixed precision, use a GPU machine. The relative improvements on CPU will be minimal." ] }, { "cell_type": "code", "execution_count": null, "id": "9c657709", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using GPU: NVIDIA GeForce RTX 3080 Ti, Available GPUs: 4\n" ] } ], "source": [ "# Common hyperparameters\n", "SEQ_LEN = 512 # Larger sequence length for more realistic time series data\n", "SEQ_DIM = 16 \n", "BATCH_SIZE = 64\n", "MAX_EPOCHS = 5\n", "NUM_SAMPLES = 10000 \n", "\n", "# Helper: create fresh dataset and model for each experiment\n", "def make_data_and_model():\n", " dm = SineND(\n", " seq_len=SEQ_LEN,\n", " seq_dim=SEQ_DIM,\n", " num_samples=NUM_SAMPLES,\n", " batch_size=BATCH_SIZE,\n", " data_dir=\"../data\",\n", " )\n", " model = VanillaDDPM(seq_len=SEQ_LEN, seq_dim=SEQ_DIM)\n", " return dm, model\n", "\n", "# Store results for comparison\n", "results = []\n", "\n", "# Detect available accelerator\n", "if torch.cuda.is_available():\n", " ACCELERATOR = \"gpu\"\n", " NUM_GPUS = torch.cuda.device_count()\n", " print(f\"Using GPU: {torch.cuda.get_device_name(0)}, Available GPUs: {NUM_GPUS}\")\n", "else:\n", " ACCELERATOR = \"cpu\"\n", " NUM_GPUS = 0\n", " print(\"Using CPU (GPU recommended for meaningful speedups)\")" ] }, { "cell_type": "markdown", "id": "b42cc2be", "metadata": {}, "source": [ "## 1. Baseline: Standard Training (FP32) \n", "\n", "This is the default training mode — single device, full 32-bit precision. We'll use this as the reference point for comparison." ] }, { "cell_type": "code", "execution_count": 3, "id": "029a4579", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3080 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading SineND dataset in ../data/SineND.pt\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 45.24it/s, v_num=2, train_loss_step=0.0754, val_loss=0.0774, train_loss_epoch=0.0771]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=5` reached.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 44.44it/s, v_num=2, train_loss_step=0.0754, val_loss=0.0774, train_loss_epoch=0.0771]\n", "Baseline training time: 66.97s\n" ] } ], "source": [ "dm, model = make_data_and_model()\n", "\n", "trainer = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=ACCELERATOR,\n", " devices=1, # Single device\n", " precision=\"32-true\", # Full FP32 precision (default)\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/baseline\",\n", ")\n", "\n", "start = time.time()\n", "trainer.fit(model, dm)\n", "elapsed_baseline = time.time() - start\n", "\n", "results.append({\n", " \"Strategy\": \"Baseline (FP32)\",\n", " \"Precision\": \"32-true\",\n", " \"Devices\": 1,\n", " \"Time (s)\": round(elapsed_baseline, 2),\n", " \"Speedup\": \"1.00x\",\n", "})\n", "print(f\"Baseline training time: {elapsed_baseline:.2f}s\")" ] }, { "cell_type": "markdown", "id": "19d31c6a", "metadata": {}, "source": [ "## 2. Mixed Precision Training \n", "\n", "Mixed precision uses **FP16** (or BF16) for forward/backward passes while keeping master weights in FP32. This reduces memory usage and can significantly speed up training on modern GPUs (Volta+).\n", "\n", "Lightning makes this a one-line change via the `precision` argument:\n", "\n", "| Precision | Description | Recommended Hardware |\n", "|-----------|-------------|---------------------|\n", "| `\"32-true\"` | Full FP32 (default) | Any |\n", "| `\"16-mixed\"` | FP16 mixed precision with dynamic loss scaling | NVIDIA Volta+ (V100, A100, RTX) |\n", "| `\"bf16-mixed\"` | BF16 mixed precision (no loss scaling needed) | NVIDIA Ampere+ (A100, RTX 30/40 series) |\n", "| `\"16-true\"` | Pure FP16 (less stable) | Not recommended for training |" ] }, { "cell_type": "code", "execution_count": 4, "id": "e584afd6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using 16bit Automatic Mixed Precision (AMP)\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 40.58it/s, v_num=1, train_loss_step=0.085, val_loss=0.0782, train_loss_epoch=0.0772] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=5` reached.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 40.03it/s, v_num=1, train_loss_step=0.085, val_loss=0.0782, train_loss_epoch=0.0772]\n", "FP16 mixed precision training time: 13.98s (Speedup: 4.79x)\n" ] } ], "source": [ "dm, model = make_data_and_model()\n", "\n", "# FP16 mixed precision — just change the `precision` argument\n", "trainer_fp16 = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=ACCELERATOR,\n", " devices=1,\n", " precision=\"16-mixed\", # FP16 mixed precision\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/fp16\",\n", ")\n", "\n", "start = time.time()\n", "trainer_fp16.fit(model, dm)\n", "elapsed_fp16 = time.time() - start\n", "\n", "speedup = elapsed_baseline / elapsed_fp16 if elapsed_fp16 > 0 else float(\"inf\")\n", "results.append({\n", " \"Strategy\": \"FP16 Mixed Precision\",\n", " \"Precision\": \"16-mixed\",\n", " \"Devices\": 1,\n", " \"Time (s)\": round(elapsed_fp16, 2),\n", " \"Speedup\": f\"{speedup:.2f}x\",\n", "})\n", "print(f\"FP16 mixed precision training time: {elapsed_fp16:.2f}s (Speedup: {speedup:.2f}x)\")" ] }, { "cell_type": "markdown", "id": "4984acaf", "metadata": {}, "source": [ "### BF16 Mixed Precision (Ampere+ GPUs)\n", "\n", "If your GPU supports BF16 (NVIDIA A100, RTX 3090/4090, etc.), `bf16-mixed` is preferred over `16-mixed` because it has a wider dynamic range and doesn't require loss scaling." ] }, { "cell_type": "code", "execution_count": 5, "id": "46f02250", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using bfloat16 Automatic Mixed Precision (AMP)\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 39.10it/s, v_num=1, train_loss_step=0.0777, val_loss=0.0762, train_loss_epoch=0.0768]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=5` reached.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 38.54it/s, v_num=1, train_loss_step=0.0777, val_loss=0.0762, train_loss_epoch=0.0768]\n", "BF16 mixed precision training time: 14.01s (Speedup: 4.78x)\n" ] } ], "source": [ "dm, model = make_data_and_model()\n", "\n", "# BF16 mixed precision — requires Ampere+ GPU\n", "trainer_bf16 = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=ACCELERATOR,\n", " devices=1,\n", " precision=\"bf16-mixed\", # BF16 mixed precision\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/bf16\",\n", ")\n", "\n", "start = time.time()\n", "trainer_bf16.fit(model, dm)\n", "elapsed_bf16 = time.time() - start\n", "\n", "speedup = elapsed_baseline / elapsed_bf16 if elapsed_bf16 > 0 else float(\"inf\")\n", "results.append({\n", " \"Strategy\": \"BF16 Mixed Precision\",\n", " \"Precision\": \"bf16-mixed\",\n", " \"Devices\": 1,\n", " \"Time (s)\": round(elapsed_bf16, 2),\n", " \"Speedup\": f\"{speedup:.2f}x\",\n", "})\n", "print(f\"BF16 mixed precision training time: {elapsed_bf16:.2f}s (Speedup: {speedup:.2f}x)\")" ] }, { "cell_type": "markdown", "id": "6113eaea", "metadata": {}, "source": [ "## 3. Distributed Data Parallel (DDP) \n", "\n", "When you have multiple GPUs, **DDP (Distributed Data Parallel)** splits the dataset across GPUs, where each GPU processes a different mini-batch independently. Gradients are synchronized via all-reduce after each backward pass.\n", "\n", "Lightning handles DDP out of the box via the `strategy` and `devices` arguments:\n", "\n", "```python\n", "# Use all available GPUs with DDP\n", "trainer = Trainer(strategy=\"ddp\", devices=\"auto\")\n", "\n", "# Use specific GPUs\n", "trainer = Trainer(strategy=\"ddp\", devices=[0, 1])\n", "\n", "# Use N GPUs\n", "trainer = Trainer(strategy=\"ddp\", devices=2)\n", "```\n", "\n", "> **Important:** DDP spawns separate processes per GPU. In a Jupyter notebook, DDP may not work correctly because of how Python multiprocessing interacts with notebooks. **For DDP, use a standalone script** (see below). In this notebook we demonstrate the API, but you should run multi-GPU experiments via command line.\n", "\n", "### DDP Script Example\n", "\n", "Save the following as `train_ddp.py` and run with:\n", "```bash\n", "python train_ddp.py\n", "```\n", "\n", "Lightning will automatically handle process spawning for DDP." ] }, { "cell_type": "code", "execution_count": 6, "id": "f5ff034d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "import time\n", "import torch\n", "from lightning import Trainer\n", "from gents.dataset import SineND\n", "from gents.model import VanillaDDPM\n", "\n", "SEQ_LEN = 48\n", "SEQ_DIM = 2\n", "BATCH_SIZE = 64\n", "MAX_EPOCHS = 50\n", "NUM_SAMPLES = 5000\n", "\n", "dm = SineND(\n", " seq_len=SEQ_LEN, seq_dim=SEQ_DIM,\n", " num_samples=NUM_SAMPLES, batch_size=BATCH_SIZE,\n", " data_dir=\"../data\",\n", ")\n", "model = VanillaDDPM(seq_len=SEQ_LEN, seq_dim=SEQ_DIM)\n", "\n", "trainer = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=\"gpu\",\n", " devices=2, # Use 2 GPUs\n", " strategy=\"ddp\", # Distributed Data Parallel\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/ddp\",\n", ")\n", "\n", "start = time.time()\n", "trainer.fit(model, dm)\n", "elapsed = time.time() - start\n", "\n", "if trainer.global_rank == 0:\n", " print(f\"DDP (2 GPUs) training time: {elapsed:.2f}s\")\n", "\n" ] } ], "source": [ "ddp_script = '''\n", "import time\n", "import torch\n", "from lightning import Trainer\n", "from gents.dataset import SineND\n", "from gents.model import VanillaDDPM\n", "\n", "SEQ_LEN = 48\n", "SEQ_DIM = 2\n", "BATCH_SIZE = 64\n", "MAX_EPOCHS = 50\n", "NUM_SAMPLES = 5000\n", "\n", "dm = SineND(\n", " seq_len=SEQ_LEN, seq_dim=SEQ_DIM,\n", " num_samples=NUM_SAMPLES, batch_size=BATCH_SIZE,\n", " data_dir=\"../data\",\n", ")\n", "model = VanillaDDPM(seq_len=SEQ_LEN, seq_dim=SEQ_DIM)\n", "\n", "trainer = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=\"gpu\",\n", " devices=2, # Use 2 GPUs\n", " strategy=\"ddp\", # Distributed Data Parallel\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/ddp\",\n", ")\n", "\n", "start = time.time()\n", "trainer.fit(model, dm)\n", "elapsed = time.time() - start\n", "\n", "if trainer.global_rank == 0:\n", " print(f\"DDP (2 GPUs) training time: {elapsed:.2f}s\")\n", "'''\n", "\n", "print(ddp_script)\n", "ddp_time = 10.97 # Replace with actual timing result from running the above script\n", "speedup = elapsed_baseline / ddp_time if ddp_time > 0 else float(\"inf\") # Replace with actual timing result\n", "results.append({\n", " \"Strategy\": \"DDP (2 GPUs)\",\n", " \"Precision\": \"32-bit\",\n", " \"Devices\": 2,\n", " \"Time (s)\": ddp_time, # actual timing result in py script\n", " \"Speedup\": f\"{speedup:.2f}x\",\n", "})" ] }, { "cell_type": "markdown", "id": "7f861dbb", "metadata": {}, "source": [ "## 4. Gradient Accumulation \n", "\n", "Gradient accumulation simulates a larger batch size without requiring more GPU memory. Instead of updating weights every step, gradients are accumulated over `K` steps before an optimizer update.\n", "\n", "**Effective batch size** = `batch_size × accumulate_grad_batches`\n", "\n", "This is useful when:\n", "- Your GPU memory can't handle a large batch size directly\n", "- You want to experiment with larger effective batch sizes for stability" ] }, { "cell_type": "code", "execution_count": 7, "id": "601262a0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 49.00it/s, v_num=1, train_loss_step=0.0766, val_loss=0.0798, train_loss_epoch=0.0811]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=5` reached.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: 100%|██████████| 110/110 [00:02<00:00, 48.16it/s, v_num=1, train_loss_step=0.0766, val_loss=0.0798, train_loss_epoch=0.0811]\n", "Gradient accumulation training time: 11.14s (vs baseline: 6.01x)\n" ] } ], "source": [ "dm, model = make_data_and_model()\n", "\n", "# Accumulate gradients over 4 batches → effective batch size = 64 × 4 = 256\n", "trainer_accum = Trainer(\n", " max_epochs=MAX_EPOCHS,\n", " accelerator=ACCELERATOR,\n", " devices=1,\n", " precision=\"32-true\",\n", " accumulate_grad_batches=4, # Gradient accumulation\n", " enable_progress_bar=True,\n", " enable_model_summary=False,\n", " default_root_dir=\"../lightning_logs/scale_up/grad_accum\",\n", ")\n", "\n", "start = time.time()\n", "trainer_accum.fit(model, dm)\n", "elapsed_accum = time.time() - start\n", "\n", "speedup = elapsed_baseline / elapsed_accum if elapsed_accum > 0 else float(\"inf\")\n", "results.append({\n", " \"Strategy\": \"Gradient Accumulation (4x)\",\n", " \"Precision\": \"32-true\",\n", " \"Devices\": 1,\n", " \"Time (s)\": round(elapsed_accum, 2),\n", " \"Speedup\": f\"{speedup:.2f}x\",\n", "})\n", "print(f\"Gradient accumulation training time: {elapsed_accum:.2f}s (vs baseline: {speedup:.2f}x)\")" ] }, { "cell_type": "markdown", "id": "52ce2ae6", "metadata": {}, "source": [ "## 6. Performance Comparison \n", "\n", "Let's visualize the results from all experiments." ] }, { "cell_type": "code", "execution_count": 8, "id": "31210083", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Strategy | \n", "Precision | \n", "Devices | \n", "Time (s) | \n", "Speedup | \n", "
|---|---|---|---|---|---|
| 0 | \n", "Baseline (FP32) | \n", "32-true | \n", "1 | \n", "66.97 | \n", "1.00x | \n", "
| 1 | \n", "FP16 Mixed Precision | \n", "16-mixed | \n", "1 | \n", "13.98 | \n", "4.79x | \n", "
| 2 | \n", "BF16 Mixed Precision | \n", "bf16-mixed | \n", "1 | \n", "14.01 | \n", "4.78x | \n", "
| 3 | \n", "DDP (2 GPUs) | \n", "32-bit | \n", "2 | \n", "10.97 | \n", "6.11x | \n", "
| 4 | \n", "Gradient Accumulation (4x) | \n", "32-true | \n", "1 | \n", "11.14 | \n", "6.01x | \n", "