GPU Acceleration#
Overview#
linumpy supports GPU acceleration for compute-intensive operations using NVIDIA CUDA via CuPy. GPU acceleration is optional - all functions automatically fall back to CPU (NumPy/SciPy) if:
CuPy is not installed
No CUDA-capable GPU is available
GPU memory is insufficient
flowchart TD
CALL[GPU-aware function called<br/>backend='auto' default] --> CHK1{CuPy installed?}
CHK1 -->|no| CPU[Run on CPU<br/>NumPy / SciPy / SimpleITK]
CHK1 -->|yes| CHK2{CUDA device available?}
CHK2 -->|no| CPU
CHK2 -->|yes| PICK[Auto-select least-loaded GPU]
PICK --> RUN[Run CuPy kernel on device]
RUN -->|OOM / runtime error| CPU
RUN -->|success| OUT([Result])
CPU --> OUT
backend selection is per-call (backend="cpu" | "gpu" | "auto"); the auto path is the safe default and is what the Nextflow workflows use when use_gpu=true.
Quick Start#
# Check your CUDA version
nvidia-smi | grep "CUDA Version"
# Install linumpy with GPU support (choose your CUDA version)
uv pip install 'linumpy[gpu]' # CUDA 12.x (default)
uv pip install 'linumpy[gpu-cuda13]' # CUDA 13.x
# Verify GPU
linum_gpu_info.py
linum_diagnose_pipeline.py --benchmark
Installation#
Requirements#
NVIDIA GPU with CUDA Compute Capability 3.0+
CUDA Toolkit 11.x, 12.x, or 13.x
CuPy matching your CUDA version
Recommended GPU: NVIDIA A6000 (48GB) or similar professional GPU.
CuPy Version Reference#
CUDA Version |
CuPy Package |
linumpy extra |
|---|---|---|
CUDA 12.x |
|
|
CUDA 13.x |
|
|
BaSiCPy (fix_illumination)#
The fix_illumination step uses BaSiCPy 2.x, which now ships with a
PyTorch backend (no JAX). BaSiCPy will use a CUDA-enabled PyTorch wheel
automatically when one is installed; otherwise it runs on CPU.
If you only need linumpy’s CuPy paths (resampling, FFT, morphology, N4),
no extra steps beyond pip install 'linumpy[gpu]' are required. To enable
GPU acceleration of BaSiCPy as well, install a CUDA build of PyTorch:
# Pick the index URL that matches your CUDA toolkit
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
Verify:
linum_gpu_info.py
linum_diagnose_pipeline.py --benchmark
GPU-Accelerated Scripts#
All scripts now accept --use_gpu / --no-use_gpu (default: --use_gpu).
GPU acceleration is enabled automatically when a CUDA device is detected; no
separate _gpu.py variant is needed.
Script |
GPU-accelerated operation |
Typical Speedup |
|---|---|---|
|
FFT / phase correlation |
8-47x |
|
Volume resize |
5-12x |
|
Volume resize |
5-12x |
|
Gaussian filter, Otsu threshold |
4-10x |
|
BaSiCPy via PyTorch/CUDA |
2-5x |
|
SSIM, morphology |
3-8x |
|
Mean projection |
≤1x |
|
Mean projection |
≤1x |
|
N4 bias field estimation |
varies |
|
Phase correlation |
8-16x |
GPU-Accelerated Operations#
Major Improvements (7-70x speedup)#
Operation |
Function |
Typical Speedup |
|---|---|---|
Binary Morphology |
|
7-67x |
FFT/iFFT |
|
9-47x |
Gaussian Filter |
|
7-20x |
Phase Correlation |
|
8-16x |
Resampling |
|
5-12x |
Medium Improvements (4-10x speedup)#
Operation |
Typical Speedup |
|---|---|
Normalization |
4-10x |
Percentile Clipping |
4-10x |
Interpolation |
5-10x |
Mask Creation |
2-4x |
No GPU Benefit (use CPU)#
Operation |
Reason |
|---|---|
Mean/Max Projection |
Simple reduction, transfer overhead dominates |
Galvo Detection |
Simple computation |
Multi-GPU Systems#
# Show status of all GPUs
linum_gpu_info.py --status
# Select best GPU (most free memory)
linum_gpu_info.py --select-best
# Use specific GPU via environment
CUDA_VISIBLE_DEVICES=1 nextflow run pipeline.nf --use_gpu true
Memory Management#
The NVIDIA A6000’s 48GB VRAM typically holds entire mosaic grids. For larger volumes:
import cupy as cp
# Clear GPU memory cache
cp.get_default_memory_pool().free_all_blocks()
# Check memory usage
mempool = cp.get_default_memory_pool()
print(f"GPU memory used: {mempool.used_bytes() / 1e9:.2f} GB")
Troubleshooting#
GPU Not Detected#
nvidia-smi
python -c "import cupy; print(cupy.cuda.runtime.getDeviceCount())"
linum_gpu_info.py
BaSiCPy / PyTorch CUDA Issues#
If linum_fix_illumination_3d.py falls back to CPU unexpectedly, verify the
PyTorch CUDA build is installed and visible:
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
linum_diagnose_pipeline.py --debug-cuda
Reinstall PyTorch from the matching CUDA index URL (see the BaSiCPy section
above) if torch.cuda.is_available() returns False.
Out of Memory#
Reduce batch size / chunk size
Use
cp.get_default_memory_pool().free_all_blocks()between operationsSet
use_gpu=Falsefor specific operations
Reference Benchmarks#
Actual benchmarks on NVIDIA RTX A6000 (48GB):
Operation |
Image Size |
CPU Time |
GPU Time |
Speedup |
|---|---|---|---|---|
FFT2 |
2048×2048 |
207ms |
4.4ms |
47x |
Phase Correlation |
2048×2048 |
3876ms |
240ms |
16x |
Gaussian Filter |
2048×2048 |
79ms |
4.1ms |
20x |
Binary Closing |
2048×2048 |
99ms |
1.5ms |
67x |
Resize |
2048→1024 |
29ms |
3.4ms |
9x |
Mean Projection |
100×2048×2048 |
71ms |
146ms |
0.5x |
Module Reference#
from linumpy.gpu import (
GPU_AVAILABLE, # bool: Is GPU available?
CUPY_AVAILABLE, # bool: Is CuPy installed?
GPU_DEVICE_NAME, # str: GPU name
GPU_MEMORY_GB, # float: GPU memory in GB
to_gpu, to_cpu, # Transfer functions
print_gpu_info, # Print info
)
# Submodules
from linumpy.gpu.fft_ops import fft2, ifft2, phase_correlation
from linumpy.gpu.interpolation import resize, affine_transform
from linumpy.gpu.morphology import binary_closing, gaussian_filter
from linumpy.gpu.array_ops import normalize_percentile, clip_percentile